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ABSTRACT 


An  information  system  is  a  chain  (or,  more  generally,  a 
network)  of  symbol-processing  components,  each  characterized  by 
costs  and  delays,  find  by  the  probabilities  of  its  outputs, 
given  an  input.  In  recent  times,  statisticians,  engineers,  and 
even  philosophers  have  all  shown  increasing  tendency  to  accept 
the  economist’s  way  of  comparing  information  systems  according 
to  their  average  costs  and  benefits, — the  former  depending,  in 
part,  on  the  delays  between  the  events  inquired  about  and  the 
actions  decided  upon. 

Statisticians  have  concentrated  on  the  economic  choice  of 
only  these  two,  ihe  initial  and  the  terminal  components  of  the 
system:  "inquiry”  and  "decision  rule”.  And  they  have  tended 
to  neglect  the  processing  delays  arising  in  these  as  well  as  in 
the  intermediate  components  of  a  system.  Engineers,  on  the 
other  hand,  have  concentrated  on  the  intermediate  components 
that  form  the  "communication  sub-chain":  "memorizing",  'fencoding", 
"transmitting",  "decoding".  And  they  have  been  concerned  with 
the  processing  delays  that  depend  on  the  average  number  of  code 
symbols  needed  (and  thus  on  the  "entropy"  to  be  removed  by 
communication)  . 

For  simplicity,  we  have  assumed  that  utility  (the  quantity 
whose  expected  value  is  maximized  by  the  user)  is  the  differ¬ 
ence  between  costs  and  benefits.  The  current  literature  on 
communication  assumes  implicitly  that  other  choice  criteria 
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(such  as  the  length  of  a  code  word)  are  also  additive,  and  that 
channels  with  equal  capacity  are  equally  costly.  These  assump¬ 
tions  may  need  to  be  qualified,  by  studying  channel  costs  and 
the  economic  effects  of  communication  delays. 

The  economically  minded  user  must  consider  the  several 
system  components'  jointly?  and  it  turns  out  that,  in  certain  im¬ 
portant  cases,  the  average  difference  between  the  benefit  and  • 
cost  to  a  user  is  maximized  by  large-scale  demand.  Moreover, 
the  aggregate  demand  of  all  users  will  depend  on  the  joint 
supply  conditions  for  the  various  system  components.  It  will 
thus  depend,  for  example,  on  the  cost  economies  due  to  the 
"packaging"  of  several  components,  to  standardization  and  large- 
scale  production.  This  opens  up  the  question  whether  social 
interest  is  best  served  by  a  competitive  market  in  information 
processing  equipment  and  services,  human  as  well  as  inanimate. 
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0.  Introduction 

0.1  The  economist's  general  information  problem.  Out 
of  several  pushbuttons,  each  of  a  different  color,  you  select 
one.  A  slight  push,  and  massive  amounts  of  energy  are  re¬ 
leased,  and  are  transformed  in  the  manner  you  have  prescribed. 
The  button  colors  which  you  have  perceived  and  from  which  you 
have  selected,  exemplify  signs,  symbols.  Your  'manipulation 
of  symbols",  equally  vaguely  called  "handling  of  information" 
has  involved  little  energy  but  has  discharged  and  directed 
a  large  amount.  You  have  done 'brain  work."  No  economist 
will  deny  that  a  large  part  of  our  national  product  is  contri¬ 
buted  by  symbol  manipulation  —  telephoning  orders,  discussing 
in  conferences,  shuffling  papers,  or  just  performing  some  of 
the  humble  tasks  required  of  the  inspector,  or  even  an  ordi¬ 
nary  worker,  on  the  assembly  line.*^ 

— 

See  Marschak  [1938A],  a  paper  addressed  to  a  wider  audxence 
and,  in  essence,  revised  here  in  a  somewhat  more  pre¬ 

cise  fashion.  For  some  earlier  results  see  Marschak  [1954]. 
Kuch  Is  owed  to  discussions  with  J.  MacQueen,  END  OF  FOOTNOTE. 

The  economist  asks,  first:  what  determines  the  demand 
and  supply  of  the  goods  and  services  used  to  manipulate 
symbols.  This  may  help  him,  second,  to  understand  how  social 
welfare  is  affected  by  the  manner  in  which  resources  are  allo¬ 
cated  to  those  goods  and  services. 
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A  pre-requisite  is,  to  define  concepts  and  study  their 
interrelations  in  a  way  that  would  prove  useful  for  the 
answering  of  these  questions.  The  economist  begins  by  assu¬ 
ming  that  those  who  demand  and  use,  and  those  who  Droduce 
and  supply,  the  goods  and  services  considered,  make  choices 
that  are  "economical"  («=  "rational")  in  some  usefully  defined 
way,  and  are  made  under  well-defined  constraints.  The  con¬ 
straints  may  include  limitations  on  the  choosers’  memories 
and  other  abilities.  The  economic  theorist  leave  the  door 
open  to  psychologists,  sociologists,  historians,  and  to  his  own 
'institutionalist"  colleagues  in  the  hope  they  will  help  to 
determine  the  values  of  underlying  parameters,— provided 
(another  hope!)  they  do  not  establish  that  the  assumption  of 
economical"  choice  fails  to  yield  usefully  close  approxima¬ 
tions  to  begin  with.  I  take  this  back:  even  then,  he  will 
°ffer  his  results  as  recommendations  to  users  and  producers 
of  "information-handling",  or  "informational",  goods  and 
services. 


°*2  The  user’s  problem,  viewed  bv  non-economists. 
Besides  its  interest  to  economists,  the  manipulation  of  sym¬ 
bols,  or  information  processing,  has  been  the  domain  of  philo- 

of 

sophers  ^nd  linguists;/coraputer  scientists, control  theorists 

of 

and  communication  engineers;  and/statisticians.  The  latter, 
following  the  path  of  J.  Neymann  and  A.  Wald,  have  become 
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more  and  mori  concerned  with  the  economical  manner  of  ob¬ 
taining  "information",  and  have  discovered  much  that  is  useful 

to  the  economist.  Engineers  have  proposed  a  measure  of  'in- 
transmitted” 

formation /  based  on  probability  relations  between  classes  of 
arbitrary  signs.  This  arose  out  of  practical,  ’economic" 
needs  of  the  communication  industry.  My  task  will  be,  in 
part,  to  see  how  those  results  fit  into  the  general  economics 
of  symbol-manipulating  goods  and  services, — including,  for 
example,  the  services  of  statisticians,  and  of  men  who  design 
or  handle  computers  and  control  mechanisms.  (The  task  of 
the  last-named  men  is  indeed  to  apply  economics  to  to-day's 
most  varied  and  complete  combinations  of  informational  goods 
and  services  I) . — Finally,  attempts  have  been  made  on  the  part 
of  philosophers  and  linguists  ^  to  modify  the  engineers' 

*^See  e.g.,  Carnap  and  Bar-Hillel  (1952];  somewhat  differently. 
Miller  and  Chomsky  [1963] .  END  OF  FOOTNOTE 

measure  into  semantic  information"  or  "content”  measure — 
essentially  by  substituting  for  a  olass  of  arbitrary  signs 
its  partition  into  equivalence  classes  consisting  of  signs 
with  identical  "content"  ("meaning”) . 

In  recent  years,  the  approach  via  economic  rationality — 
(bluntly*  via  the  expected  utility  to  the  decision  maker) — has 
begun  to  penetrate  the  work  of  both  engineers  and  philosophers. 
An  important,  though  still  not  sufficiently  well  known  step. 
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was  made  by  pioneer  C.  Shannon  himself  [1960J  when  he  re¬ 
moved  his  earlier  tacit  assignment  or*  equal  penalty  for  all 
communication  errors.  He  introduced,  instead,  a  "fidelity 

criterion".  This  is  indeed  utility  itself — albeit  confined 
(as  we  shall  see) 

/to  the  context  of  communication  only  and  therefore  defined 
on  a  very  special  class  of  actions  and  events.  And  Ronald 
A.  Howard  (1966)  writes,  in  a  broader  context: 

"  . .  .The  early  developers  stressed  that  the  information 
measure  was  dependent  only  on  the  probabilistic  structure 
of  the  communication  process.  For  example,  if  losing  all 
your  assets  in  the  stock  market  and  having  whale  steak 
for  dinner  have  the  same  probability,  then  the  information 
associated  with  the  occurence  of  either  event  is  the  same. 

. . .No  theory  that  involves  just  the  probsfcilities  of  out¬ 
comes  without  considering  their  consequences  could  poss¬ 
ibly  be  adequate  in  describing  the  importance  of  uncer¬ 
tainty  to  the  decision  maker." 

his  analysis  of  a  neat  model 

He  concludes/’vith  a  challenge  to  his  profession  (and  perhaps 
to  mine  as  well)  : 

'if  information  value  and  associated  decision  theoretic 
structures  do  not  in  the  future  occupy  a  large  part  of  the 
education  of  engineers,  then  the  engineering  profession 
will  find  that  its  traditional  role  of  managing  scientific 
and  economic  resources  for  the  benefit  of  man  has  been 
forfeited  to  another  profession.1' 


~i  •  n»  J  »»,  >*♦*»  *  -rt  V  T. 
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And  philosopher  R.  Carnap  whom  v»e  have  mentioned  as  one 
of  the  early  proponents  of  a  "  semantic'*  information  measure 
("content  measure')  wrote  in  a  more  recent  [1966]  oaper: 

"When  I  consider  the  application  of  the  concept  of 

probability  in  science  then  I  usually  have  in  mind  in 
of  oredifctitwfeL  rftid.  oh^.chacHlod.-'iriiy.  thw  probability 
the  first  place  the  probability/of  laws  or  theories. 

Once  we  see  clearly  which  features  of  prediction  are  de¬ 
sirable  ,  then  we  may  say  that  a  given  theory  is  prefer¬ 
able  to  another  one  if  the  predictions  yielded  by  the 
first  theory  possess  on  the  average  more  of  the  de¬ 
sirable  features  than  the  prediction  yielded  by  the  other 
theory . " . . . 

He  then  proceeds  to  show  that  if  a  practically  acting  man" 
"bases  his  choice  either  on  content  measure  alone 
or  on  probability  alone ,  he  will  sometimes  be  led  to 
choices  that  are  clearly  wrong."  "We  should  choose  that 
action  for  which  the  expectation  value  of  the  utility  of 

outcome  is  a  maximum ."( pp .  252,  253-4,  257)  *) 

*\  another  oaper 

'Tn  the  quoted  paper,  he  also  says  that /[Carnap,  1962) 

„  .  (strongly  influenced  by  Ramsey,  De  Finetti,  and  Savage) 

"gives  an  exposition  of  my  view  on  the  nature  of  inductive 
logic  which  is  clearer  and  from  my  present  point  of  view  more 
adequate  than  that  which  I  gave  in  my  book,"  viz.  in  Carnap 
[1950] . 


O.S 

0.3  Individual  demand  for  information  services.  Thus 
encouraged  by  the  spread  of  understanding  of  the  economic 
approach  to  information  use,  I  shall  proceed  with  my  task,  a 
more  special  one  than  the  general  economic  information  problem 
outlined  at  the  beginning.  I  shall  study  the  rational  choice¬ 
making  of  an  individual  from  among  available  information  sys¬ 
tems,  or  available  components  of  such  systems.  The  availability 

constraint  specifies,  in  particular,  the  costs  and  the  delays 

with  or  networks 

associated  with  given  components,  or/phains/bf  components,  of 

information  systems.  As  is  familiar  to  students  of  the  market, 

the  available  set  depends  on  the  choices  made  by  suppliers. 

In  last  effect,  joint  choices  by  demanders  and  suppliers  would 

determine  which  information  systems  are  in  fact  produced  and 

conditions 

used  under  given  external  conditions.  Theso/lnclude  the  tech¬ 
nological  knowledge  of  those  concerned. 

I  shall  not  be  able  to  make  more  than  casual  remarks  on 
the  supply  The  first  of  the  two  general  questions 

to  be  asked  by  the  economist,  the  joint  determination  of  demand 

a 

and  supply,  will  therefore  receive  only/partial  answer  .  The 

socially 

second  question,  that  of/optiraal  allocation  of  resources  to 
informational  goods  and  services,  is  pushed  away  still  farther. 
This  is  not  to  say  that  the  allocation  question  cannot  be 
studied  till  the  demand  and  supply  of  informational  goods  and 
services  is  fully  understood.  Significant  work  of  Hurwicz 
[I960],  Sticler  [1961,1962],  Hirshleifer  [1967],  Radner  [1967, 
1968]  testifies  to  the  contrary. 
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1.  PROCESSING 

1.1  Processing  P  is  defined  as 

F  -  <  X,  X,  ?,  y,  t  >  ,  where 

X  =  set  of  inputs  x 
X  =  set  of  outputs  y 

“i  =  transformation  from  X  to  X,  including  the  case  of  sto¬ 
chastic  transformation  (see  he low) 
v  &  transformation  from  X  to  non-negative  reals,  measuring 
cost  (in  cost  units) 

t  =  transformation  from  X  to  non-negative  reals,  measuring 
delay  (in  time  units) 

X,  X  are,  generally,  random  sets.  As  to  !>:  in  a  special  case  called 
"deterministic"  or  "noiseless,"  T  is  an  ordinary  function;  i.e.,  it 
associates  every  x  in  X  with  a  unique  y  =  *;(x)  in  X.  However, 
we  must  consider  the  more  general  case,  called  "stochastic"  or  "r.oisy,  * 
in  which,  instead,  T  associates  every  x  in  X  with  some  ("condi¬ 
tional")  probability  distribution  on  X.  For  simplicity  of  presentation 
we  shall  usually  (except  for  some  economically  interesting  examples ) 
assume  X  and  X  finite, 

X  =  (l,...,m),  X  =  (l,...,n)  , 


so  that 
matrix. 


i.e.. 


*=  Proc(y=j  *x=i) .  Hence  V-  -  1  is  a 

all  fl  s  0  and  £■’  =  1  for  all  x. 

*y  v  xy 


m  y  n  Markov 
Put  see  Blackwell 


"r.,?53lfor  an  extension  of  the  concept  of  stochastic  transformation  to 
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infinite  sets.  Clearly,  the  special,  deterministic  rase  occurs  if  one 

element  in  each  rov  of  the  matrix  *?.  "  is  =  1;  then  ve  can  vrite 

*  xy 

'l.i.i)  ^  1-'  y  ,  r(=)  • 

As  to  yj  we  shell  assume  y(x),  the  cost  of  processing  a  given  input 
x,  to  he  constant.  We  thus  forego  tha  discussion  of  a  more  general, 
stochastic  case,  in  which  y(x)  is  a  probability  distribution  of 
costs,  given  x  .  Similarly,  ve  as-iume  that  the  time  r(x) 

required  to  process  a  given  input  x  is  conc+ant. 

1.2  Cost-relevant- inputs .  111  important  cases, 

exemplified  by  processings  called  "storage"  ar.d  " transportation,"  two 
otherwise  different  inputs,  x  -  I  and  x  =  V  ,  eey,  are  such  that 

y(1)  =  y(i* ) .  (it  costs  the  seme  to  trans¬ 

port,  over  100  miles,  a  gallon  of  whiskey  or  of  gasoline.)  It  is  then 
convenient  to  replace  the  original  set  X  by  a  reduced  set  X/  v 
consisting  of  equivalence  classes  x/  y  such  that  all  elements  of 
the  same  class  are  associated  with  the  same  cost . 

1.3  Available  (feasible)  processings.  For  given  £,  J,  not  all 
triples  (t»,  y,  t)  are  available.  For  example,  to  implement  u  given 
transformation  T1  at  lowered  delays  t(x)  for  all  x  may  require  raised 
costs  y(x).  The  set  of  available  processings  will  be  denoted  by  P. 

1.4  Purposive  processing.  Consider  a  case  in  which  the  y  in  Y 
rnov  to  be  rewritten  as  a  in  A  =  (1, ...,n)1  can  be  interpreted  as 
the  actions  (decisions)  of  a  person  whe  obeys  certain  axioms  of  decision 
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logie^  and  the  inputs  x/[nov  tc  be  rewritten  as  Z  =  (l, .„.,ia)l  are 

events  beyond  his  control.  Then  there  exists  a  probability  distribution 

rr  =  vector  [tt  1  end  a  bo’inded  real-valued  "utility  function" 

2 

:>)(a.fzfy(z),T(z))  such  xhat,  given  two  available  processings 
P*  =  <  Z»,A*,T!»,y,,t,>s  P"  =  <  Z" ,  A" , T" ,  y" > tw>  > 
the  chooser  of  a  processing  will  choose  P*  only  if 

U  (P>)  a  r  (P")  , 

“»iOV  *  TT(j}V  7 

where,  for  any  processing  P,  its  (expected)  utility  is 

(i.U.1)  U  (r>  =  S  E  Vb*m(.),.(.))  • 

Z  8. 

It  follows  that,  given  the  characteristics  of  the  chooser  (viz.,  tt,  «, 
listed  in  the  subscript  under  U  for  convenience)  and  given  the  avail¬ 
able  set  P,  processing  P*  will  bo  chosen  only  if 

P»PP,  U  (P*)  r  u  (P),  all  P  in  °  . 

Hole  that  "chooser"  was  the  word  used,  instead  of  "decision-maker" : 
see  also  Sec.  2.2  where  the  chooser  of  P  will  be  called  meta-decider. 

1.5  Timing.  Utility  depends  on  action,  accordingly,  we  consider 
that  the  utility  is  "earned, •' 


l/  I  refer  tc  the  work  of  F.P.  Ramsey,  B.  Ee  Finetti,  L.J.  Savage,  accepted 
in  recent  years  by  professional  logicians  R.  Carnap  and  R.C.  Jeffrey.  For 
a  survey  see  Kars c ha k  T1968B1; alee,  regarding  Carnap  and  regarding  the  re¬ 
lation  of  probability  to  frequency  see  Karscbak  [19701.  That  certain  ob¬ 
served  behavior  is  not  really  inconsistent  with  the  expected  utility  rule 
if  cost  or  feasibility  of  storing  or  other  processing  is  accounted  for, 
was  brillantly  shown  by  S.  Winter  [1966].  Among  the  many  merits  of  Raiffa’s 
delightful  introduction  to  the  field  [39661  is  his  forceful  emphasis  on  the 
need  for  and  the  possibility  of  training  people  for  consistency. 
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end  the  action  a  is  taken,  at  the  same  time. 
But  the  cost  *X z )  is  incurred  r(z)  time  units  earlier. 

1.6  Continued  purposive  processing.  It  is  often  necessary  to  rein¬ 
terpret  the  output  a  end  input  z  as  time -sequences,  vith  "horizon"  2 
possibly  infinite: 


(  .6.1) 


a  =  fa  2  =  t  -  1, . .  «,T  . 


An  element  of  the  transformation  Tj  is  then  the  conditional  prob¬ 

ability  of  a  particular  sequence  of  T  actions,  given  a  sequence  of  T 
events.  Applying  the  results  of  Koopmans  ("19601,  the  utility 
oj(a,z,y(z),T(z))  entering  the  definition  (l.4.l)  of  the  utility  of 
processing  can  be  decomposed  thus: 


(1.6.2) 


T  r  t(zb} 

«(&,z,y(z)»t(z))  =  ^"(at,zt,.,<zt))3S=1 


■where  the  "discount  constant"  d  (0  <  d  <  1)  and  the  function  v  are 
independent  of  time  and  a^,  are,  respectively,  the  "histories  up 
to  t": 


(l.  ^.3)  2^  =  (  Z^,  .  .  .  ,Z_j.  )  • 

1.7  Additive  costs  and  discounted  benefits.  A  convenient  though 

rather  special  assumption  is  often  tacitly  made  in  practice.  It  is 

utility 

assumed  that,  given  any  distribution  tt,  the  /  of  processing, 
lT  (P)  increeses  in  the  "expected  discounted  benefit,"  B,  and  de- 

TT{m  — 

creases  in  Expected  cost,"  C.  Before  defining  B  and  C  precisely. 


\ 
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let  us  state  the  assumption  in  tvo  other,  obviously  equivalent,  forms: 

for  any  given  tt,  (l)  "of  all  processings  vith  the  same  C,  the  one 

utility 

vith  highest  B  has  highest  /  "j  and  (2)  "the  efficient  subset  of 

P  consists  of  all  those  available  processings  for  vhich  the  pair  (-C,B) 

is  not  dominated  by  any  other  such  available  pair." 

If,  or.  the  other  hand,  the  assumption  does  not  hold,  then  a  n recessing 
utility, 

may  exceed  in  /  and  hence  should  be  chosen  in  preference  to,  another 
processing,  even  though  the  latter  has  lover  expected  cost  and  higher 
expected  discounted  benefit.  It  vill  be  shown  that  the  stated  tacit 
assumption  implies  that  the  utility  function  •<,  ie  decomposable  in  a 
certain  sense. 

More  precisely,  ve  define 

(1.7.1)  C  h  C  (y)  stWz)  , 

z  z 

(1.7.2)  B  s  B  (?,t)  s  ?  2  0(a,z)dT(z)-  P  , 

■  z  y  z 

vhere  «  is  the  "benefit  function"  from  Z  "  A  to  reals,  and  d  is 

(1.4.1), 

the  discount  constant.  [SimilAr  to  /  subscripts  under  C,  B  convey 
the  relevant  characteristics  of  the  decision  maher.]  Note  that  d 
occurs  in  (l.7»2)  but  not  in  (1.7.1).  Hiis  is  because  of  the  assumption 
on  timing  in  Section  1.5;  this  difference  vill  be  removed  when  ve  study 
processing  chains,  as  in  Section  l.?. 

It  follows  from  the  general  theorem  on  multi -criterion  decisions 
(Appendix  l)  that  U  is  monotone  increasing  in  B  and  in  -C  if 

rr*-,  — 


and  only  if  there  exists  a  function  p  and  constants  d  and  k  such 
that 

(1.7.3)  <a(a,z,«.<z),T(z))  =  -kv(z)  +  3(a*z)dT^j  k  >  Oj 

then 

(k  is  a  conversion  factor*  fixing  the  choice  of  units).  It /follows  “by 
utility 

(l.4.l)  that  the  /  of  processing  is  monotone  in  -C  and  B  (for 
all  rr)*  if  and  only  if  it  is  a  linear  combination* 

(1.7.4)  U  =  -kC  +  B,  k  >  0  . 

(Elsewhere*  B  was  called  "expected  gross  payoff" :  see  Marschak 

and  Radner  Tin  press!). 

1.6  Benefit-relevant  events  and  actions.  It  is  convenient  to 
define  Z  and  A  in  such  a  manner  that 

3(a,z)  =  0(a*z* ),  all  a  £  A,  only  if  z  =  z*  * 

and  l(a*z)  =  f)(a‘*z)*  all  z  *  Z*  only  if  a  <=  a'  . 

Thus  if  Z  and  A  are  finite*  so  that  p  can  be  represented  as  a 
"benefit  matrix" 

■»  *  ' 

no  two  columns  and  no  two  rows  are  identical.  (Returning  to  Section  1.2* 
we  note  that  z  and  z'  may  be  equivalent  with  respect  to  costs 

but  not  equivalent  with  respect  to  benefits.)  And  no  generality 
is  lost  if  all  the  dominated  rows  are  deleted.' 

1.9  Processing  chains.  Define  a  sequence 


P1  PK 

r  *;r  ; 


vfcere 
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(1.9.1)  P*  =  <  Xk,Xk+1,‘i)1Svk,Tk>  ,  k  =  1,...,IC  . 

let  X*  have  n^.  elements,  so  that  Tk  is  of  order  n^x  n^+^.  Such 
a  sequence  is  equivalent  to  a  processing 

(1.9.2)  P  =  <  •  ,  •  >  , 

vhere 

(1.9.3)  =  r  "f  . 

k=l 

With  P  in  (1.9.2)  equivalent  to  the  sequence  (1.9. l),  the  values 
achieved  by  P  and  by  that  sequence  should  be,  in  a  purposive  case, 
equal.  This  makes  it  inpossible,  in  general,  to  fill  the  places  indi¬ 
cated  by  dots  in  (1.9. 2)  by  single  real -valued  functions.  Rather,  the 
utility 

/  of  P  (if  F  is  purposive)  would  depend  on  the  sequences 
{*/”},  {*rk},  k  =  1, ...,X.  This  is  easily  seen  by  applying  the  decomp¬ 
osition  of  utility  over  time  as  in  (1.6.2)  to  the  case  (1.7*3)  of  addi¬ 
tive  costs  and  benefits* 

1.10  Networks.  More  general  than  a  chain  1e  a  network,  in  which 
each  transformation  may  have  several  input  and  output  variables,  acme 
possibly  shared  with  other  transformations.  We  shall  not  pursue  this 
here.  See  Marschal;  and  Radner  [in  press],  Chapter  8, 


£  *n 


12^23 
K  x  x  x  xJ 


T) 


T/*  rxl  7 
IV  iV+X 

X  X 


bo  that 
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2.  SYMBOLS  AS  OUTPUTS  AND  PIPUTS 

2.1  A  purposive  processing  chain.  Consider  a  chain  (l.9«l)  con¬ 
sisting  of  K  successive  processing  links,  with 

Tf+1 

X1'  =  act  A  of  actions  a  , 

X^  =  set  2  of  events  z  , 

■where  a  and  z  are  typical  argianents  of  the  benefit  function  f$f a,z). 

Each  may  be  a  time -sequence  as  in  (l,6.l).  Some  physical  processes 

cause  an  action  and  event  to  Jointly  yield  same  physical  consequence 

(again  possibly  a  time-sequence:  e.g.,  a  sequence  of  annual  monetary 

profits),  to  which  a  benefit  number  is  attached.  But  we  shall  not  be 

the 

concerned  with  these  physical  processes,  and /chains  (or  networks)  that 
they  fora. 

The  inputs  and  outputs  of  the  intermediate  processing  links, 
P2,...,?K  do  not  enter  the  benefit  function.  As  in  Section  1.2,  two 
elements  x£,  x^  of  the  set  X*,  k  =  2,  ...,IC,  can  be  considered 


equivalent  if  their  processing  costs 


equal: 


V  V  1-  V 

i\/  iv\  Aw  AV\ 

V  (xx)  =  v  (x2) 

It  will  be  convenient  to  reserve  the  term  "symbols"  for  these,  "benefit- 


neutral"  but  "cost- 


relevant,"  inputs  and  outputs.  Thus  the 


links  P2, ...jP1-  will  be  said  to  process  symbols  onto  symbols.  Typical 
examples  are:  trarslation  (e.g.,  encoding,  decoding)  of  messages;  trans¬ 
mission  of  messages  over  distances;  and  their  storage  over  time.  On 
the  other  hand,  an  event  or  an  action  (even  that  cf  a  painter  or  composer) 
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will  not  be  called  a  symbol;  but  processing  link  P1  vill  be  said  to 
transform  event  into  a  symbol;  and  P1''  will  be  said  to  transform  a 
symbol  into  action. 

Choosing  the  chain:  a  meta-decision.  The  action,  or  decision, 

__  k+1 

a  s  x  ,  tne  output  of  the  last  link  in  the  purposive  chain  must  be 

distinguished  from  the  decision  to  choose  one  rather  than  another  chain- 
difference  between 

The/expected  benefit  and  cost  is  maximized  by  the  chooser  of  the  chain. 

The  chooser  may  hire  men  or  machines  to  perform  the  successive  processings, 

including  the  ultimate  one,  viz.,  the  choice  of  action,  or  decision.  If 

link 

this  ultimate  processing  /is  called  deciding,  the  choice  of  it  and  of 
->ther  links  of  the  chain  may  be  called  meta-deciding. 

s»3  Some  information  53*3 terns.  A  purposive  processing  chain  is 
often  called  an  information  system,  the  word  information  presumably 
bearing  some  relation  to  transformations  from  and  into  symbol  sets. 
Information  about  a  physical  fact  is  not  the  fact  itself  but  some  "sym¬ 
bols"  (e.g words)  associated  with  it.  Historically,  two  kinds  of 
"shortened"  chains  have  teen  considered  by  specialists:  statisticians 
on  the  one  hand,  and  coirmuni cation  engineers  on  the  other.  They  are 


(a)  a  two-link  chain,  with 

X^  =  Z  =  events 

X2  =  Y  =  data, 

,  observations 

X  =  A  =  actions, 
decisions 

(b)  a  four-link  chain,  with 


P^-  =  experiment,  inquiry 
2 

P  =  strategy 


X^  =  Z  -  messages  to  be 
sent 


-.2 


=  long  stored  sequences 


of  messages 


P^  =  storing 
2 

P  =  encoding 
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O  3 

XJ  =  encoded  messages  PJ  =  transmission 

k 

X  =  received  messages  P  =  decoding 

as  A  =  =  decoded 

messages 

o  chains  a'  a:r.  >'  are  linked  tether  on  i’igure 

To  suit  special  applications,  some  special  assumptions  are  usually 

made,  diffeient  in  (a)  and  in  (b),  regarding  the  sets  of  input3  end  outputs, 
the  sets  of  available  processings,  the  cost  and  delay  functions  y  and 
t,  and  the  benefit  function  We  shall  indicate  some  of  those  assump¬ 
tions  and  the  implications  ci  re  reeving  them,  in  due  course. 

Both  (a)  and  (b)  can  be  considered  as  special  cases  of  sane  longer 
chain.  It  seems  that  such  longer  chains  are  necessary  to  describe,  in 
their  full  richness,  the  operations  of  a  computer  (including  problem¬ 
solving,  simulation,  pattern  recognition,  etc.).  Ere  popular  descrip¬ 
tion  of  these  operations  as  "information  processing"  would  then  appear 

a  felicitous  one.  Ibis  would  include,  for  example,  programmed  navigation. 
See  Cherncff  r  19>1. 

In  the  following  three  Sections  3;  5>  ve  deal  with  the  two-links 

chain  (a),  and  study  the  consequences  of  some  simplifying  assumptions 
used,  in  effect,  in  the  literature  of  ’statistical  decision  theory." 

These  results  are,  in  fact,  applicable  also  to  information  systems 
consisting  of  any  number  of  links,  with  actions  based,  not  directly  on 
observations  (outputs  of  the  "inquiry"  link),  but  on  the  outputs  of  sub¬ 
sequent  processings  (e.g.,  encoding,  transmitting)  of  observations,  by 
(l.?.3),  the  system’s  transformation  matrix  T]  is  the  product  o?  the 
successive  transformation  matrices,  7]^,  of  its  links;  and  the  lat+er 
need  net  be  specified  if  the  assumptions  listed  in  Section  3  are  made. 
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Accordingly,  in  the  next  three  sections,  ?  vill  te  called,  interchange¬ 
ably,  the  inquiry  matrix  or,  to  he  more  general,  the  transformation 

matrix  of  an  ini  emotion  system  or,  briefly,  information  matrix. 

t'1  he 

2.4  Of  the  assumptions  listed  in  Section  z  that  of  additive  cost  is 
perhaps  least  offensive  and  is,  at  the  same  time  fruitful  of  important 
results,  for  it  permits  to  concentrate  on  the  properties  of  the  informa¬ 
tion  matrix  T>  On  the  other  hand,  the  question  of  successive  delays 
(operation  speeds  and  capacities  at  successive  links),  2R6tlv  neglected 
in  the  tvc-links  theory  and  introduced  in  our  Section  4  in  general  terms 
only,  vill  become  a  serious  one  vhen  the  processing  chain  is  lengthened 
by  inserting  links  that  implement  the  communication  between  the  observer 
and  the  decision-maker. 
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3.  SQUIRING  AND  EECIDirC  U  STATISTICAL  THEORY 

3.1  The  two -link  chain.  Link  in  the  two-link  chain  (a)  of 
Section  2.3  has  been  variously  called  "experiment, "  "talcing  observa- 

p 

tions,"  also  "ashing  a  diagnosis."  Uni:  P  ,  "strategy'^  has  been 

also  called  'decision  rule."  Reflecting  certain  though  surely  not  all 

aspects  of  statistical  practice,  the  usual  analysis  of  the  tvo-linl; 

chain  makes  tacitly  some  restrictions  which  do  not  appear  necessary  or 

justifiable  in  the  broader  context  of  economic  comparison  of  purposive 

1  2 

processings.  In  particular,  the  delays  t  (z),  t  (y)  are  neglected; 

2 

and  so  are  the  constraints  on  strategies,  and  their  cost,  y  (y). 

On  -he  other  hand,  in  most  statistical  writings,  our  environmental 

variable  z  Is  generalized,  as  follows.  The  event  (or,  in  the  case  of 

continued  processing,  a  time -sequence  of  events)  is  replaced  by  a  prob- 
called  "hypothesis," 

ability  distribution/  so  that  our  tt  becomes  a  distribution  on  the 
space  of  probability  distributions  cf  scree  variable  v.  However,  this 
complicated  description  of  the  problem  is  equivalent,  end  can  be  reduced, 
to  the  original  problem,  with  v  playing  the  role  of  the  event  z.  Vfe 
shall,  therefore,  not  pursue  this  further. 

3.2  Neglecting  delays.  brnile ,  as  will  be  shown,  the  speed  of  proc¬ 
essing  is  attached  great  importance  in  the  exist:  ng  work  of  communica¬ 
tion  engineers  who  study  the  several- links  chain  (b)  described  in 

Sec.  2.3*  processing  speed  is  completely  neglected  in  the  statistical 
theory  of  the  two-links  chain  (a).  No  explicit  attention  is  paid  to 
whether  it  takes  an  hour  or  a  month  to  collect  a  sample,  or  to  apply  a 
given  decision  rale.  Accordingly  the  question  of  "overloaded  capacity" 


of  an  observation  equipment  o'**  decision-making  equipment  is  not,  to  my 
knowledge,  treated  explicitly  in  statistical  literature.  It  is  assumed 

i  .  , 

in  effect  that  for  all  processing  chains  considered,  t  (2)  is  the  seme 
2  x 

constant,  and  t  (y)  is  the  same  constant;  sc  that,  vher.  comparing 
the  values  of  tvo  processings,  one  can  assume  for  both,  without  loss 
of  generality, 

(3.2.1)  T1(z)  =  r2(y)  =  0  . 

Ho  doubt  this  assumption  is  not  made  in  actual  statistical  practice. 

If  the  expected  benefit  can  be 

strongly  diminished  when  decisions  are  based,  on  obsolete  data  (see 

Section  5)j  the  chooser  of  the  experiment  ar.a  the  strategy  will 

give  preference  to  accelerated  ones,  costs  permitting.  Moreover,  it  is 

this 

not  economical  to  accelerate  the  erceriment  if  /  results  in  piling  up 

unused  data  because  decisions  are  taken  too  slovly.  ruck  considerations 

surely  arise  in  industrial  quality  control,  in  marketing 

research,  in  the  preparation  of  economic  indices  for  public  policy,  and, 

*/ 

very  likely  also  in  rrtch  of  scientific  laboratory  and  clinical  work.—* 

3.3  With  delays  out  of  the  way,  the  "Statistical  Decision  Problem"  takes 
the  following  fora.  Changing  notations  somewhat,  write: 

- p(ylz)  *  v  yL'z)  ■  v pl  • <  *'v> 

(3.3.1) 

^  =  p(a’y)  =  ffyaJ  61(z)  =  6z;  P2  =  <  or,'  >  . 

The  sets  Z  and  A_  are  regarded  as  fixed;  this  and  the  fact  «lat  Y 

*J  See,  for  example,  N.O.  Anderson  [1969], 


is  the  range  of  V  and  the  domain  of  a  justifies  the  above  abbreviated 


definition  of  the  llnhs  P1,  P2.  Then  the  processing  chain  (P1 ,  ?2), 
if  available,  can  be  written  as 


(3*3.2)  P  a  (T„a,v>S)  5  P  , 

assume  additive  cost  as  in  Sec.  1.7  but 
where  P  is  the  feasible  set.  Vs/postpone  till  later  (Sec.  5  )  the 

consideration  of  continued  processing  introduced  in  Sec.  1.6.  The 

chooser  then,  maximizes,  subject  to  the  constraint  (3-3-2),  tine  expected 
utility  U 

/  which  is  the  difference  between  expected  benefit  B  (no  discount¬ 
ing  for  delay  need  be  considered)  and  expected  cost  C,  whore 


(3-3-3) 


B  *  Bn0(P)  -  £  T.  5(e,z)  it*  a 


z  y 


y  yn 


(3.3.4) 


C  =  C  (P)  =  E  rr  v  +  T,  E  tt  ?  6 

TT  7  z  z  z  zy  y 

z  z  y  J  * 


(3.3«r)  TJ  =  =  '  CTT(P-  • 

As  in  Section  1.7,  the  subscripts  under  B,  C,  'r  characterize  the  chooser. 

Together  with  the  feasible  set  P,  they  form  the  givens  of  the  chooser’s 

chain 

problem.  Hence  the  optimal  /  P*  is  a  function  of  rr,  1,  So  is 
the  efficient  set,  which  consists  of  all  elements  of  P  for  which  the 
pair  (-C ,J3)  is  not  dominated  by  any  other  such  feasible  pair. 

3»fe  Action  as  a  subset  of  events.  In  general,  there  is  no  need 
to  assume  any  formal,  logical  relation  be ti.ee n  L  and  A.  For  example, 

Z  may  be  the  set  (cancer,  no  cancer),  and  A  may  be  the  set  (surgery, 
radiotherapy,  no  treatment).  The  benefit  function  D  would  then  assign 
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a  value  to  each  of  the  2x3  =  6  pairs  (a*z).  In  statistical  literature* 
an  action  that  can  he  considered  relevant  to  the  benefit  of  the  sta¬ 
tistician's  "employer*”  can  he  identified  with  the  choice  of  one  of 
disjoint  subsets  ('alternative  hypotheses")  of  the  set  of  benefit- 
relevant  events.  Such  actions  cannot  be  more  numerous  than  events. 
True*  the  action  of  the  statistician  is*  in  other  cases,  said  to 

consist  in  choosing  from  a  set  of  overlapping  subsets  of  events:  e.g.* 

*/ 

in  naming  an  interval.—'  He  is  then  supposed  to  use  choice  criteria 

relevant*  I  think*  to  his  own*  not  his  employer's*  benefit.  It  is 

difficult  to  see  how,  for  example,  the  length  of  a  confidence  interval 

in  a  market  prediction  affects  the  seller's  profit,  given  the  state  of 
the  market. 


For  purposes  of  economics  of  information,  it  is  more  useful  to  say 

that  the  statistician's  tusk  is  to  derive,  from  observations  y,  the 

likelihoods  Tj  for  all  events  relevant  to  his  employer's  benefit. 

Given  the  prior  probabilities  tt_*  one  can  then  determine  the  joint 

z 

probabilities  tt  T)  or,  for  that  matter,  the  posterior  probabilities 
z  zy 

^nz\y^TTtnzt^*  ejnP1°yer  or  his  operations  research  man  (possibly 
Identical  with  the  statistician)  will  combine  these  probabilities  with 
the  benefits  yielded  to  the  employer  by  his  actions,  given  the  events. 


and  choose  the  action  that  maximizes  expected  benefit. 

Accordingly*  we  shall  permit  the  employer's  (user's)  actions  to  be 
mere  numerous  than  events.  This  will  lead  to  interesting  results  in  the 
economies  of  comparing  information  systems:  see  Section  6.5* 


#7  '  — 

-  Contrast  Examples  1-3  with  Example  4  in  Lehmann  [19591*  Section  1.2. 
See  also  Pratt  [196II.  I  am  indebted  to  W.  Kruskal  for  discussions  of 
this  question.  EKD  OF  FOOTNOTE. 
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To  be  store,  a  problem  of  communication  arises.  It  is,  in  fact,  the 
problem  of  optimal  encoding,  in  the  sense  of  our  Section  7,  below.  It 
may  be  costly  or  even  non-feasible  to  communicate  in  all  detail  the 
posterior  or  the  joint  probability  distributions  involved,  to  the  em¬ 
ployer,  or  to  his  operations  research  man,  or  to  a  low-echelon  decision¬ 
making  man  or  machine.  With  this  in  mind,  a  condensed  message  may  be 
used:  for  example,  the  posteilor  probability  that  z  lies  in  a  parti¬ 
cular  interval.  The  choice  of  the  interval  will  then  depend,  not  on 
the  statistician's  "tastes",  but  on  the  "meta-decider's"  judgment  as 
to  the  contributions  of  alternative  codes  to  his  benefit  and  cost. 

3.5  Neglecting  the  constraints  and  costs  of  deciding.  In  impor¬ 
tant  parts  of  statistical  literature  decision-making  is,  in  effect, 
assumed  costless  and  unconstrained.  This  strong  assumption  has  led  to 
a  fruitful  discussion  of  "comparative  informativeness"  of  the  matrices 
V  =  [T>2y] .  We  shall  pursue  it  in  some  detail  in  Sections  4  and  5. 

The  assumption  of  costless  and  unconstrained  deciding  is  too  strong 
to  have  been  actually  accepted  in  practice.  For  example,  in  the  case 
where  observations  y  and  decisions  a  =  a(y)  are  both  real-valued, 
attention  was  paid, quite  early,  to  a  special  class  of  decision  rules, 
viz.,  to  the  class  of  linear  a>  presumably  because  linear  functions 
require  less  computational  effort.  (The  theorem  that,  among  unbiased 
linear  estimators  the  least- squares  estimator  it  best,  gees  back  to  Gauss, 
I  understand.)  The  search  for  good  "robust"  statistics  is  also  due  to 
c  isiderations  of  computational  economy,  I  suppose;  as  is,  of  course,  the 
rounding-off  of  digits  in  the  computational  process, 

3.6  Value  of  information.  With  decision  undeleyed,  costless  and  un¬ 
constrained,  and  inquiry  undeleyed,  the  problem  of  the  chooser  of  a  two- 
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link  chain  P  is  simplified.  Denote  "by  {or1  the  set  of  all  stochastic 
transformations  from  Y  to  A  (any  such  transformation  is  feasible 
and  let  {lb''}  be  the  set  of  feasible  pairs  of  inquiry  transformations 
7)  and  inquiry  cost  functions  Y»  Dien  the  constraint  (3»3»2)  is  relaxed 
into 

(3.6.1)  P  -  (T|,a,v)  e  {a.  x  '(y,7!)'  > 

since  fi  =  0.  Further,  equation  (3*3.3)  is  unaffected,  hut  in  (3.3»4) 
the  term  involving  6  vanishes.  Therefore,  (3*3.5)  can  be  rewritten  as 


(3.6.2) 

U  =  Utrt(Tl,0f, y)  =  Briber)  -  CJ-.)  , 

where 

(3-6.3) 

5^(71, or)  =  :(a,z)  c 

zya 

(3.6.4) 

C„(v)  =  I  V«  • 

Define  the 

''information  vnlue"  of  Tj; 

(3.6.5) 

V^Cn)  s  max  ^ Briber)  s  B^Thcr*)/ 

then,  to  maximize  expected  utility  U  with  respect  to  7),  or,  Y  over 
their  feasible  set,  given  tt,  3,  is  equivalent  to 

(3.6.6)  max  V  ,  Ol)  -  min  C^y)  , 

71  '  Y 

subject  to  the  cost  constraint 


(3.6.7) 


(y,u)  = 


«  • 
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With  the  meta-decider's  problem  reduced  to  (3*6. 6),  (3.6.?)  it  is 
useful  to  consider  the  expected  cost  C^y)  as  fixed  and  to  compare 
various  information  matrices  1],  T)’,  ...  according  to  their  values 
(T) *  ),  ...  . 

3.7  Appropriate  action,  a  ;  value  of  observation,  V  .  The 

y  .  y 

optimal  decision  rule  a*  defined  in  ( 3 •  5 *5 )  depends  only  on  T]  and 
rr,  0: 

(3.7.1)  a*  =  <**,(D)  ; 

now,  for  each  7!,  given  rr  and  2,  there  will  exist  a  deterministic 

for 

optimal  decision  rule;  /it  is  well-known  that,  in  a  one-person  game, 
there  exists  a  pure  optimal  strategy.  Thus,  no  generality  is  lost  if 
we  define  {or!  as  the  set  of  all  mappings  from  Y  to  A .  The  assump¬ 
tion  of  costless  and  ncn-restricted  decisions  excludes  the  case  when  the 
hired  (and  presumably  cheap)  decision-making  man  or  machine  uses  a  non- 
optimal  deterministic  rule;  and  also  the  case  when  he  (it)  makes  random 
errors,"  unless  they  happen  to  constitute  an  optimal  random  strategy. 

With  {or?  reduced  to  the  set  of  all  pure  strategies,  '  ,e.,  all 
functions  or  from  Y  to  A  we  can  write  a  =  c'(y)  so  that  [similar 
to  (1.1.1)] 

1 

*  =  if  c  ,  ' 
y  0  f 

and  denote  the  action  that  is  "appropriate"  (i.e.,  optimal)  in  response 
to  y  by 

ay  =  «*(y)  ; 
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that  is,  for  a  given  y, 


(3-7.2) 


max  2  “1  z )tt  7}  s  E  i(a  ,z)tt  7|  3  V  , 

'  '  '  7.  '  y'  '  z  zy  y 


acA  z 


say.  V  may  "be  called  "value  of  the  observation  y."  It  follows  "by 

y 

(3.6.3)  and  (3.6.5)  thet  the  value  V  of  an  inquiry  is  the  sum  of  the  V 

—  y 

for 

(3.7.3)  V  =  max  2  2  ;(o(y),zb  t  , 

a  z  y  ** 


(3.7.4) 


V  =  2  max  L  A&,z)rt  T,  , 
v  J  z  zy  ’ 
y  a  z  J 


(3.7.5) 


V  =  2  V 


We  shall  write  Z  =  (l,  .  ..,m),  Y  =  (l,  ...,n);  hence  7]  is  of  order 
m  f.  n. 

3.3  La  colling  of  observations.  It  is  clear  from  (3.o.3),»  (3*6.5) 
that  V(T| )  is  invariant  under  interchange  of  columns  in  T .  Therefore, 
if  71  is  of  order  m  y  n  and  P  a  permutation  matrix  of  order  n, 
we  shall  agree  that 

(3.8.1)  I)  and  7]P  are  equivalent. 


T5iu3  if  (with  m  =  n  =  2),  z  =  1  means  "stock  will  rise"  and  z  -  2 
means  "stock  will  not  rise,"  then  the  datum  "my  broker  says  stock  will 
rise"  can  be  labelled,  indifferently,,  as  y  »  1  or  as  y  =  2.  There  is 
no  loss  of  generality  in  choosing  any  one  particular  labelling. 

Also,  no  generality  is  lost  if  we  agree  to  eliminate  any  column  of 
7]  that  consists  of  O' s  only,  and  thus  designates  (with  Y  finite,  as 
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we  recall!)  an  observation  that  never  occurs. 

It  is  seen  from  (3.7.2)  that  two  observations  y  =  J,  k,  whose 

conditional  probabilities,  given  any  event  Zj  are  pairwise  equal,  yield 

the  same  appropriate  action  a  =  a.  and  the  same  value  V  =  V.  .  It 

J  K  J  “ 

is  convenient  therefore,  and  involves  nc  loss  of  generality,  to  redefine 
every  such  inquiry  by  adding  any  two  identical  columns,  and  thus  to  make 
every  inquiry  matrix  7)  to  consist  of  non-identical  columns  only. 

3*9  Null-information  is  said  to  be  provided  by  any  matrix  7]  whose 
revs  are  identical,  so  that  we  can  write 


n 

?  ,  all  z;  ”1=1. 
y  y=l  y 


Then  by  (3.7.M 


V  = 


2,  \  max  E  ./a,  z)tt 
y  '  z 

y  J  a  z 


(3.-.1)  V  =  1  •  max  E  7(a,z)rr  , 

Z 

a  z 


so  that  V  is  independent  of  7).  Thus  all  null-information  inquiries 
have  the  same  value.  As  their  canonical  form  we  can  conveniently  choose 
the  (ra  x  l)  matrix  with  all  elements  7]^  =  1,  z  ■-  1.  ...,m.  That  is, 
the  same  unique  observation  is  obtained,  id.tr.  certainty,  whatever  the 
event,  'ike::  7}  is  the  column  vector  of  order  ja,  with  all  elements  = 
1:  a  "sum  vector,"  sometimes  denoted  by 

71  =  1  . 

3.10  Essential  set  of  inquiry  matrices.  let-  be  the  set  cf 
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all  .Markov  matrices  vith  m  rows  and  with  all  columns  non- ze rc  and  r.ct 

pairwise  identical.  Summarizing  the  conventions  just  made,  the  essential 

set  H  of  inquiries  about  m  events  is  defined  as  the  partition  {s/e1 

into  equivalence  classes;  where  T,  ana  T| 1  ir.  {!]  1  are  equivalent, 

71  -  [T;  ]  a  !)•  -  ft*  1  if  7)'  =  Tjp  for  sene  nercutation  matrix  P 

zy  zy*  *  -  - 

or  if  every  7)  ,  7)  is  independent  of  z. 

3.11  Perfect  information  will  be  said  to  be  provided  by  a  matrix 
7]  of  order  m  >'■  m  such  that  the  correspondence  between  Z  and  Y  is 
one-to-one.  That  is,  one  element  in  each  row  of  Tj  is  =1  (and  hence 
the  other  elements  in  the  row'  are  =  0)  and  77  is  noiseless,  as  in 
(l.l.l)};  and,  moreover,  in  each  column  one  element  -  1  and  all  other 

elements  are  =  3.  Thus  7]  is  a  permutation  matrix,  7!  =  Q,  say.  Its 

T  T 

transpose  C„  is  clearly  a  permutation  matrix,  too,  Q  =  P,  eey;  and 

it  is  well  known  that 

I  =  OP, 

where  I  is  the  identity  matrix.  Then  by  (3*6.1),  I  and  Q  will  be 

considered  equivalent:  without  loss  of  generality,  perfect  information 

will  be  represented  by  the  identity  matrix  I  as  its  canonical  form. 

3*12  Informativeness  and  optiaelity  of  inquiry.  In  Section  ■'+,  a 

strong  partially  ordering  relation  called  "more  informative  than”  will 

be  introduced  on  the  essential  set  K  of  information  matrices.  This 

m 

relation  is  of  general  significance  os  it  is  independent  of  tt  and  3 
and  is  in  this  sense  cctanon  to  all  users  (meta-deciders).  Some  applica¬ 
tions  to  delayed  processings  will  be  made  in  Section  5>  still  focussing 
on  values  V  only,  by  considering  expected  costs  C  as  given.  In 
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Section  6,  C  will  be  permitted  to  vary,  to  analyze  optimality  conditions 
in  greater  generality. 

3.13  Useless  inquiries .  It  will  be  seen  in  Section  4.3  that,  for 
any  v,  q,  the  value  of  ~  cannot  be  smaller  than  the  value  common 

to  all  null-information  inquiries ,  given  in  (3«9«l)»  An  inquiry  will 
be  called  useless  with  respect  to  r,  G  if  its  value,  as  defined  in 
(3.6.5)  is  equal  to  the  value  of  a  null-inquiry.  Thus  all  null-inquiries 
are  useless.  But  (as  will  be  shown  on  an  example  in  Section  6.4),  the 
converse  is,  in  general,  net  true. 

3.14  The  information  value  V  (b)  is  a  convex  function  of  T.  For, 

by  (3.6.3),  the  benefit  B  is  linear  in  its  elements  B  of  t. 

tth  zy 

Hence,  for  rr,  0  given,  all  benefit  functions  constitute  a  family  of 
(weakly)  convex  functions  of  T-.  It  follows  by  (3.6.5)  flnd  a  well-known 
theorem  (see  e.g.,  Karlin  [19591,  Appendix  B.4),  that  the  information 
value 

V^(b)  =  max  B^focr) 
a 

is  a  convex  function  of  Tj  it  is  represented  by  the  upper  envelope  of 
a  family  of  hyperplanes The  same  is  true  of  V  in  (3--7»2)» 

3.15  The  case  of  smooth  benefit  functions.  Suppose  the  set  A  of 
actions  is  non-countable ,  and  the  benefit  function  r(a,z)  is  twice 
differentiable  with  respect  to  a.  Then  the  observation  value  V  and 

y 

the  information  value  V  are  continuously  differentiable  in  the  ele¬ 
ments  b  of  t>.  Itorecver  it  can  be  c  lectured  [by  extending  the 
zy 

reasoning  thit  follows  equation  (6.5.10)1  that  in  that  case  all  useless 
inquiries  are  null-inquiries  if  A  is  unbounded  and  there  are  only  two 

benefit -relevant  o  xr.ts. 


^[/Acknowledgments  tc  a  suggestion  of  M.  Fhaa-Huu-Tri . 
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4.  COMPARATIVE  DffVRMftTIVEHESS 

4.1  Definition.  We  say,  following  Blaclrwell-^that  T)  ie_  mcre^ 
informative  than  7|’,  and  write  T]  >  T)’,  if  and  only  if 

V^Ol)  >  Vn3(T|*)  for  all  it,  ?  , 

where  tr,  {3  are  defined  on  fixed  sets  Z  and  Ax  Z,  respectively. 

By  fixing  these  6ets  rich  enough,  ve  can  apply  the  definition  of  "more 
informative  than"  to  an  arbitrarily  large  set  of  meta-deciders  concerned 
with  the  choice  among  inquiry  matrices,  provided  the  expected  cost  of 
information  is  kept  constant. 

Clearly  ">"  is  a  transitive  and  reflexive  relation,  and  thus  induces 
an  ordering  on  the  set  of  information  matrices.  It  is  a  partial  ordering 
on  this  set:  for  it  is  easy  to  construct  cases  when,  depending  on  tt,  1, 
the  information  matrix  t)  has  a  larger  or  a  smaller  value  than  . 
Clearly  the  relation  ">"  induces  also  a  partial  ordering  on  the  essen¬ 
tial  set  {'Hj./e],  defined  in  Section  3»10»  In  particular,  when  T]  e  TJ * 
then  obviously  both  T)  >  T)*  and  7]'  <  7).  We  shall  show  in  Section  4.7 
that  the  con^rse  is  also  true,  so  that  the  partial  ordering  on  the 
essen  lal  set  of  information  matrices  by  the  relation  ">"  is  a  strong  one. 


1/  Several  papers  by  Blackwell  and  also  seme  earlier  verk  by  Eoknenhlusi, 
Shapley  and  Shenas'  are  summarized,  as  far  as  ''informativeness"  is  con¬ 
cerned,  in  Chapter  **f  Black veil  and  Girshick  [19541.  See  also  Marschek 
and  Kiyasawa  [1966], 

2 /  The  "mo.e"  (rather  tnan  "not  less")  and  the  sign  ">"  (rather  than  M>") 
should  not  confuse.  Blackwell’s  notation  has  the  advantage  of  reserving 
the  sign  (usually  equivalent  to  and  *'")  for  the  case  of  identity. 
The  same  would  be  achieved  by  symbols  "2?’  and  used  in  the  economics 
of  preference. 
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4.2  Garbling.  Consider  an  information  matrix  T|  =  and 

■■  “  “  " "  zy 

suppose  that,  whenever  the  observation  y  (=  1,  ...,n)  is  cade,  the 

decision-maker  dees  not  learn  it;  instead,  a  random  device  is  used  such 

that,  given  the  otservation  y,  he  will  receive,  with  probability  g^,, 

a  signal  y'  =  1,  .  Clearly  g.  so,  T.  g  ,  =  1.  The  randan 

jy  y ?  yy 

device  is  thus  characterized  by  a  2<5arkov  matiix  G  =  [g  of  order 

yy 

n  n* .  It  follows  that,  given  the  event  z  =  1,  ...,n,  the  decision¬ 
maker  receives  signal  y*  with  probability 


(4.?,1) 


y  \gyy'  ~vzy'  ’ 


say,  where  Tl*  s  o,  r  '0*  ,  =  1.  In  effect,  he  has  u 

zy  y*  zy 

tion  matrix  T}1  =  ['•'*  , 1  of  order  xa  v  n*  such  that 

zy’  - 


used  an  inferma- 


(4.2.2) 


V  =  "G  . 


It  seems  to  agree  with  common  usage,  to  say  that  *1'  is  obtained  from 
T|  by  garbling.  Ard  it  is  intuitively  clear  that  a  garbled  information 
matrix  cannot  exceed  in  value  the  original  one;  for  the  decision-maker 
receiving  a  ''garbled"  signal  will,  at  best,  choose  an  action  appropriate 
to  that  signal,  r.ot  to  the  original  observation.  Formally,  we  have 
Theorem:  If  1,  T| 1 ,  G  ere  fcarkc v  matrices  with  TJ •  =  T,G,  then 
V  >  T:. 


Proof-  By  (3.7.2),  (3.7. 5),  (4.2.1) 


W.’  ■ £S E s  --K  *  Vn"= 


2  2y  y'z 


p.  if .3 


Eg  ,  S  E  "(a  ,,2)17  V:  = 

y,  yy  y  z  y  z  zy 


*  1  •  E  E  3(a  ,|2)tT  72 

'  V1  7  2  7V 

y  2  y 


-  E  sax  E  2(a,z)n  7)  =  V(Tj)  , 

y  a  2  ^ 


'by  (Z.l.b). 


t.3  Maximal  and  minimal  information  matrices,  Theorem: 


(*■-  .  3.  i) 


I  >71 


m 


11 


>h* 


vhere  7)  has  a  rovs  and  I  and  1  (identity  matrix  and  sun  vector 

-  m  =m 

cf  order  a)  correspond  to  perfect  and  to  null-inf onraticn  (Sections 
3.11,  3.9) •  Proof:  Verify  that 


71  = 


171, 

tt. 


fcr  any  7]  of  order  a  "  n;  then,  noting  that  7*.  and  1  arc  l&r.-cov 

-T\ 

matrices,  apply  the  Theorem  of  Sec,  k.2  on  " sarclinc. " 

the  car.nr.icv.'  x'Trx  of  th<= 

Thus/' perfect  information  vl.c  .  i:..,  r  :r vf  'v. 

constitute,  respective ."v,  tl.c 

maximal  and  minimal  elements  cf  the  lattice  in  vnich  the  essential  set 
of  information  matrices  is  partially  ordered  by  the  relation  "mere 
informative  than." 

k.k  Comparative  coarseness .  Suppose  the  Garbling  matrix  G  ir. 


for  all  y,y'  ,  That  is,  3  is  reduced  to  a  many- to- one  ’rapping,  g  , 
from  Y  =  (1, ...,n)  to  Y’  =  (l,  ...,n*  )>  and  clearly  n*  £  n  .  Then 
it  sepjrs  to  agree  with  cartoon  usage  to  say  that  Y*  is  coarser  than  Y 
(or,  equivalently,  Y  is  finer  than  Yf ).  For  example,  tvo  elements 
y.  and  y0  may  he  real  numbers  (or  vectors),  identical  except  for  the 

-u  C. 

last  digit  (or  the  last  ccnrooneut),  and  this  digit  (or  ccngoonent)  is 
emitted  in  the  element  y*  =  g(y^)  =  g(yg)  of  Y*  .  "Scare  details  are 
suppressed";  or  more  generally  (to  include  the  .limiting  case  G  =  I  , 

XX 

n*  =  n)>"no  details  are  added."  Applying  (4.4.1)  tc  (4.2.1), 


1'  ,  -  "  m2v  >  S  =  f  y'g(y)  -  y*'  : 

yesyl 

an  intuitively  ofcvious  result.  It  follovs  from  the  Theorem  of  Section 
4,2  that 

(4.4.2)  if  7  is  coarser  than  ’rr‘  then  f;  >  TJ*  • 

This  confirms  the  intuitive  assertion  that  adding  detail  (at  no  cost!) 
cannot  do  demage,  since  the  detail  can  te  ignored. 


4.$  Ilackwell's  Theorem.  We  give  this  name  to  the  proposition  that 

7)  >  TS 1  if  and  only  if  T 1  =  T*G  for  seme  Markov  matrix  G  . 

The  sufficiency  part  vas  proved  in  Section  4.2.  For  proof  of  necessity, 
see  ELackvell  [1954]  or  Marsc’nak  and  Miyasava  r19oS]  . 
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4,6  The  case  of  noiseless  information. 

Theorem:  If  T>  and  TJ *  are  noiseless  then  T}  >  T? *  if  and  only 
if  TJ*  is  coarser  than  7). 

Proof;  Sufficiency  follows  from  (4,4.2).  Necessity  follows  from 
Blackwell’s  theorem/  noting  that  if  T|'  =  7]G  and  71/  T1  *  are  noiseless 
then  "by  (4.2.1)  every  entry  in  G  is  either  1  or  0/  i.e./  G  is  noise¬ 
less.  (For  a  possibly  more  instructive/  direct  proof  see  Marschak  and 
Radner  [ in  press]  ) 

k,7  Strong  ordering  by  informativeness .  It  can  he  shown  that 
for  any  two  non-cull  information  matrices  TJ,  T>' , 

7  -  (T;)  =  V -(71s )  for  all  n,  7 

(4.7.1) 

if  and  oily  if  71  and  Tj’  are  identical  up  to  a  permutation 
of  columns. 

The  sufficiency  part  of  this  proposition  is  obvious  («ee  also  Section  3-S) . 
The  necessity  part  can  be  restated  using  the  ordering  relation  and 

the  equivalence  relation  e^  of  Section  3*10/  thus: 

(4.7.2)  If  71  >  7]*  and  TJ’  >  7]  then  7,  e  7)’  . 

It  would  follow  that  (as  stated  at  the  end  of  Section  4,l)  the  partial 
ordering  of  the  essential  set  of  ■* nf oroation  matrices  by  the  relation 
"more  informative  than"  is  a  strong  one. 

Outline  of  proof.  The  hypothesis  of  (4.7.2)  implies  by  Blackwell's 
theorem  (Section  4.5/  that  there  exist  two  Markov  matrices  G,  G’  such 
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that 

(4.7.3)  w  «  HG,  T)  =  71'G'  , 

and  hence 

(4.7.4)  1]  =  T]GG'  . 

We  can  use  tvo  lemmas  (proofs  omitted).  First,  to  shov  that  GG'  =  I 
unless  7]  is  null,  we  use 

Lemma  1;  If  A  and  B  are  tvo  Markov  matrices  and  A  =  AB  then 
3  is  an  identity  matrix  or  A  consists  cf  identical  rows . 

The  proof  of  the  theorem  in  completed  hy  using 

lemma  2;  If  the  product  of  tvo  Markov  matrices  is  the  identity 
matrix,  then  they  are  permutation  matrices . 

Note:  Ifce  theorem  of  this  Section  is  obvious  for  the  case  of  noiseless 
information  matrices,  in  viev  of  Section  4.6:  for  if  g  maps  Y  onto 
Y' ,  and  g*  naps  Y'  onto  Y,  then  g  and  g'  most  he  one-to-one 
mappings.  —  For  the  general  case,  I  vould  have  liked  hut  have  not 
succeeded  to  provide  a  direct  proof,  not  involving  Blackwell's  theorem 
and  in  a  sense  more  instructive:  to  shov  that  the  equality  in  (4.7.1) 
cannot  he  maintained  under  some  well-chosen  variations  of  rr,  53,  except 
when  Tj  e  7J * , 
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5.  INFORMATIVENESS  OF  SYSTEMS  OVER  TIME 

5.1  Environment,  action,  and  observation  as  time- 
sequences  .  One  or  both  of  the  arguments  a,  z  of  the  benefit 
function  £  can  be  interpreted  as  time- sequences,  as  in 
(1.6.1),  assuming  additive  costs  as  in  Section  1.7.  With  z 
a  time-sequence,  it  will  be  convenient  (changing  our  termin¬ 
ology  somewhat)  to  call  _z  the  environment  and  to  reserve 
the  term  "successive  events"  to  the  components  of  the  sequence 
z.=  {zt},  t  =  t.,...,tT;  to  give  unit-length  to  each  of  the 
intervals  .).  i  =  1,...£T;  and  sometimes  to  make  t.=  1, 

so  that  t  =  1, . . . ,T .  Each  component  afc  of  a  will  be 
called  successive  action.  If  the  benefit  can  be  represented 

as  a  sum  of  discounted  "successive  benefits" 

T 

(5.1.1)  £(a,z)  =  Z  dt  P* (a  , z  ) , 

t=l  z  z 

say  (as  would  be  implied  by  the  assumption  (1.6.2)  combined 
with  (1.7.3)),  then  it  is  important  to  agree  that  afc  and 
zfc  need  not  "physically"  occur  simultaneously:  e.g.,  afc  may 
be  "sell  stock  short  to-day”  and  zt  may  be  "stock  price  a 
month  from  to-day.,; 

A  successive  action  at  is  taken,  using  the  decision 
rule  at  ,  in  response  to  yfc  (note  the  bar!)  where  yt 
is,  the  remembered  past  history  of  successive 

observations, 

yt  *  ^t-u  *  *  *  *,yt-l,yt^ 7 


(5.1.2) 
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the  time-length  p  measures  the  length  of  memory.  Again, 
the  subscript  t,  in  yt  means  only  that  the  action  taken 
at  time  jfc  is  based  on  yfc  ;  it  does  not  necessarily  mean 
that  yt  ,  the  last  component  of  yt,  was ’physically"  ob¬ 
served  at  time  t  . 

In  this  interpretation,  v  becomes  a  distribution  on 
the  set  Z  of  sequences  z  .  The  information  matrix  l| 
transforms  (stochastically,  in  general)  the  environment  z 
into  a  sequence  of  remembered  histories, 

(5.1.3)  y  =  (yt_^,....,yT)  e  Y  ; 

that  is,  1)  is  the  probability  of  the  sequence  £  of  re¬ 
membered  histories,  given  a  particular  environment  (i.e.,  a 
particular  sequence  of  successive  events),  ?.  -  (z1,...,zT). 

A  strategy  a  is  a  sequence  of  functions  c^,...,^,  where 
=  at(yt),  thus  a  is  a  function  from  Y  to  the  set  A  of 
act ion- sequences.  (As  stated  in  Section  3.7,  the  ,  and 
thus  a  ,  need  not  be  stochastic)  ,  With  these  generalizing 
interpretations,  the  results  of  Section  4  apply. 

5.2  Effect  of  memory  length  on  informati veness.  Let 
ix'  <  p.  ;  let  inquiry  1)’  yield  remembered  history 
(5.2.1)  y£  =  (Yt-jxi » •  •  •  »Yt^ 

whenever  inquiry  T]  yields  remembered  history 

*t  =  (yt-p>  •••,yt-n',**,,yt) ; 


(5.2.2) 
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clearly  Tp  is  coarser  than  1)  .  Hence  by  (4.4.2)  T\  is 
more  informative  than  1)'  . 

5.3  Delayed  vs.  prompt  perfect  information.  Prompt 
perfect  and  delayed  perfect  information  are  defined,  respec¬ 
tively  by 

yfc  =  >  t®5  1,... ,T 

y t  =  z  ,  t  =  0  +  1,...,T  ; 

0  is  the  delay,  an  integer  with  0  <  6  <  T  .  Now,  the  :e  is 

a  one-to-one  correspondence  between  the  set  Z  of  env^ron- 

the 

ments  z  (sequences  of  successive  events)  on/one  hand,  and„ 
on  the  other;  the  set,  Z  (say)  of  sequences  z  =  (z^  . . . ,  r-T) 
of  past  histories,  zfc  =  (z^,...,z^.),  of  successive  events: 
f°r  zt+1=(zt,zt+1) .  Replace  Z  by  Z  and  redefine  £ 
and  tt  accordingly.  Then  prompt  perfect  inquiry,  T\  ,  say, 
is  represented  by  the  identity  matrix  I;  but  delayed  perfect 
inquiry  is  not.  Hence  >  1)’  ,  by  (4.3.1)  .  A  delay  cannot 
improve  perfect  information.  But  if  prompt  information  is  not 
perfect,  its  value  can  be  exceeded  by  that  of  delayed  (perfect 
or  inperfect)  information.  Thus,  detailed  survey  data,  even 
when  2  years  old,  may  be  more  valuable  (because  less  "coarse": 
see  section  4.4)  than  those  of  a  less  detailed  survey  made 
at  the  time  the  action  is  taken. 
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5 . 4  Perfect  information  with  long  ”S.  short  delay 
when  the  environment  is  Markovian.  Given  the  distribu¬ 

tion  7r  on  tho  set  of  environments  (sequences  of  successive 
events)  we  can  derive  the  conditional  probability  of  the  event 
zfc  given  the  preceding  past  history,  *) 

Pj.  =  P(zt i  zt-l^  * 

and  also  the  conditional  probability  of  zt  given  z,._^  , 
pt  *  p(zJzt-l*  * 

The  environment  z  is  said  to  be  Markovian  if 

(5.4.1)  pt  =  pt  . 

Theorem .  £f  z  is  Markovian  then  a  perfect  inquiry 
with  shorter  delay  is  more  informative  than  a  perfect 
inquiry  with  longer  delay. 

Outline  of  Proof.  ’7e  omit  the  proof  of  the  following 

Lemma:  If  z  is  Markovian  and  t^<  t^<  t  ^ 

then  p(z  |z  ,z  )  =  p( z^_  jr  ). 

c3  Z2  1  “3  2 

Now  let  two  perfect  inquiries,  T]g  and  T)e  f ,  be  characterized, 
respectively,  by 

(5.4.2)  yt  =  zt_e  ,  yj  =  zt_,, 

where  6  <  9'  .  If  £  is  Markovian  then  by  the  Lemma, 

(5.4.3)  p(zt|yt,y|)  =  o(zt|yt), 

or  temporarily  omitting  the  subscript  t.  for  brevity. 

We  aie  the  same  functional  symbol  p  for  various  conditional 
and  joint  probabilities.  o ( • ] - ) ,  p(-.*)r  no  ambiguity  arises 
if  one  oays  attention  to  the  arguments  within  the  parentheses. 
END  OF  FOOTNOTE. 


Q. 
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o(z|y.y»)  =  p(z|y),  that  is 

?(z»y>y')/p(y,y‘)  -  p(z>y)/p(y);  hence 

v(z,y>y') /p(z,y)  ~  p(y,y')/?(y) , 

p(2>y*y')/p(y|z)  .p(z)  =  p(y *  f y) , 

p(z,y,y,)/p(z)  =  p(yjz) *p(y‘|y) , 

p(y»y'  !2)  *  p(y|  z)  .p(y*  |y)  ; 

summing  over  y  and  restoring  the  subscript  t.  , 

P(y4jzt)  =  E  p(ytlzt) *p(y£|yt) . 
yt 

Thus  inquiry  T|g  f  can  be  obtained  from  T]g  by  garbling,  as 
in  (4.2.1)  .  Hence-  by  the  Theorem  of  Section  4.2,  Tig  is  more 
informative  than  t]Q,  . 

As  in  the  case  9  *  0  discussed  in  Section  5.3  for 
all  (not  necessarily  Markovian)  environments,  the  condition 
01  <  9  does  not  imply  greater  informativeness  of  T]Q  com¬ 
pared  with  Hg ,  ,  if  are  not  perfect  inquiries  in 

the  sense  of  (5.4.1);  for  then,  even  if  js  is  Markovian, 
(5.4.3)  would  not  follow.  So  that,  again,  a  shorter  delay 
can  be  profitably  traded  off  against  greater  precision. 

Furthermore,  shorter  de’  v/  is  not  necessarily  advan¬ 
tageous  if  the  environment  is  not  Markovian  but  is,  for 
example,  periodic.  Restaurant  menus  do  not  vary  much  as  be¬ 
tween  Sundays,  and  also,  in  Catholic  countries,  as  between 
Fridays.  And  both  differ  from  each  other  and  from  the  menus 
of  other  days  of  the  week.  In  a  Catholic  country,  before 


deciding  on  a  Thursday  where  to  eat  next  Sunday,  it  i.?  best 
to  knew  next  Sunday1 s  menu  (8  =  0  ,  as  in  Section  5.3);  but 
the  next  best  is  to  learn  the  menu,  not  of  next  Friday 
(9=  2  days)  but  of  the  previous  Sunday  (0=7  days) I 

5.5  Obsolescence  and  impatience.  The  discount  constant 
d  ,  as  used  in  Sections  1.6,  1.7,  reflects  a  feature  of  the 
utility  function,  sometimes  called  impatience.  It  is  one  reason 
why  delays  diminish  the  value  of  an  inquiry  (and,  more  generally, 
of  information  systems:  see  end  of  Section  2.3).  TTe  see  now 
another  reason,  which,  when  it  is  applicable,  may  be  more  power¬ 
ful:  the  obsolescence  of  the  inputs  to  the  decision-making.1^ 

5 • 6  Sequential  inquiries  and  adaptive  programming. 

The  concent  afc  of  a  successive  action  (decision)  can  be  use¬ 
fully  extended  to  include  decisions  about  the  observations  to 
be  taken  at  the  next  po_nt  of  time.  Thus 

(5.6.1)  at  =  (a£  ,  \+1>  , 

where  a^.  may  be  called,  successive  action  in  the  ordinary 
sense  (it  encers  the  benefit  function)  and  is  " inquiry 

at  time  t+1  ."  Both  are  chosen  simultaneously,  on  the  basis 


Further  analysis,  using  some  special  classes  of  environment 
distributions  tt  and  benefit  functions  g  is  given  in  Chapter 
7  of  Marschak  and  Radner  [in  oressj . 


o.  5.7 


of  remembered  history,  y^.  .  Sequential  sampling  in  statistics 
is  a  special  case,  v.'ith  a£  including  among  its  values  the 
null-action:  do  nothing  that  vou’.d  directly  influence  the 

benefit,  and  ^t+1  including  among  its  values  the  null- 
inquiry  ^  is  null,  (i.e.,  ordinary  action  is  post¬ 

poned)  and  f|t+^  is  non-null  (i.e.  further  observations  arc 
taken) ,  till  some  point  t  (say)  such  T)  is  null  (obeer- 
vations  cease)  and  a|  is  non-null  (  ’terminal  action  ’)  .  The 
more  general  case  is  ’earn  while  you  learn" . 

Inquiring  and  deciding  over  time,  including  the  general, 
sequential  case  just  discussed  is  sometimes  called  adaptive 
programming.  This  is  sometimes  described  as  a  sequence  of 
step-wise  revisionsof  the  probability  distribution  of  the  en¬ 
vironment,  starting  with  the  prior  distribution  jt  and  re¬ 
placing  it  with  posterior  distributions,  given  past  histories, 
P(z|Yt)  *  =  1,...  This  description  can  lead  to  misappli¬ 

cations,  if  the  researcher  estimates  each  of  these  successive 
distributions  by  some  conventional  parameters  (means,  variances, 
for  example) .  The  parameter  actually  needed  is  the  opti¬ 

mal  action  a*  (say)  itself!  Also,  a  misleading  distinction 
is  sometimes  made  between  ’stochastic  programming’’  in  which 
the  distribution  of  jz  is  known,  and  "adaptive  programming ’ 
in  which  it  is  gradually  learned.  But  actually,  once  the 
knowledge  of  the  prior  distribution  w  is  admitted  the  mathe¬ 
matical  processes  needed  to  compute  the  optimal  sequence  of 
actions  (including  inquiries  as  in  (5.6.1) are  equivalent.1* 

Bellman  [1961],  Harschak  (1963],  Miyasawa  [1968]. 
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6.  OPTIMAL  INQUIRIES 

6 . 1  Binary  information  matrices  as  an  example .  The 
'  likelihood 

/  matrix  '  T;  =  p:  ]  is  called  binary  if  it  is  of  order 
2x2,  so  that  Z  =  (1,2),  Y  =  (1,2)  and  we  can  write 

■'ll  =  1  ni2  =  P1 

(6.1.1) 

^22  "  1  “  Si  “  P2  * 

To  avoid  triviality,  we  assume  the  probabilities  ir„  of  the 
two  events  to  be  both  positive: 

(6.1.2)  0<7t2=1-tt1<1  . 

Binary  information  matrices  are  widely  used  in  statis¬ 
tics.  In  testing  against  a  null-hypothesis,  the  "error 
probabilities  of  first  and  second  kind"  are  defined  as 
! ii , ^22  or  the*r  complements.  Binary  "channel  matrices”  are 
much  used  in  the  theory  of  communication.  We  shall  look  to 
both  fields  for  examples  when,  later  in  this  section,  we  com¬ 
pute  the  maximal  difference  between  expected  benefit  and  ex¬ 
pected  cost,  using  sampling  costs  as  wall  s  the  coat  of  a 
channel. 


6.2  Informativeness  of  binary  inquiries.  (For  brevity, 
we  speak  of  "inquiries"  instead  of  'information  matrices"  even 
though  we  are,  in  fact,  concerned  with  the  stochastic  trans¬ 
formation  1)  characterizing  the  whole  information  processing 


chain:  see  Section  2.3) .  Given  the  matrix 


<6.?.1>  1|  =  H  i  =  /  Pl  ’  ?1\  ,  0  <  p..  <  l,i  =  1,2, 

^  ^2  »>) 

there  is  no  loss  of  generality  in  oem.uting  the  columns  so  as 
to  make 

(6.2.2)  p^  +  P2  >  1  • 

(The  "error  orobabilities  of  two  kinds",  usually  taken  to  be 
small,  would  then  be  denoted  by  l-pj,l-pp).  Define  the  two 
likelihood  ratios 

(6.2.3)  =  p1/(l-p2);  \2  =  pj/U-o^  . 

Then  under  the  convention  (6.2.2), 

(6.2.4)  >  1,  X2  >  1  , 

and,  denoting  by  ,  X ^  (y  =  i,2),  respectively,  the 

likelihoods  and  likelihood  ratios  characterizing  the  matrix 
/V) 

T) '  ,  we  have  the  following 

Theorem.  (1)  if  X^1*  2  Xy2)  (y  =  1,2),  then  >T1(2), 

and  conversely . 

(2)  If  *yl)l  Py2)  (y  =  1,2),  then  T)(1>>11(2), 
but  the  converse  is  not  true. 

(1)  and  the  first  part  of  (2)  follow  from  Blackwell's  theorem 
(Section  4.5  above).  For  the  seer.  1  part  of  (2)  let 

Then  o}1>>  p{2)  but  o*1J<  o*2) ;  yet  Xj1J  >  l|2)  .X*1*  >  X22) 

(1)  >  T)(2)  . 


so  that  by  (1)  T] 


p  •  ^  o 


then  condition  6.2.5  is  satisfied.  Thus,  regardless  of 
penalties  for  errors  of  first  and  second  kind  (i.e.,  regard¬ 
less  of  the  benefit  matrix  (5  :  see  Section  6.k)  it  may  pay 
to  decrease  the  err^r  probability  of  only  one  kind  while  in¬ 
creasing  that-  of  the  other.  ^?ee  Figure  l). 

It  is  clear  that  all  null-information  matrices  (Section 
3.9)  satisfy 

(6.2.:.)  P]+P2  *  1?  Xi=  X2=1,  *main  dia9onal  in  Fi9*  D 

while  perfect  information  is  characterized  by 

(6.2.3)  Dj=P2  =  1;  X ^,1 2  inf initejf point (1, 1)  in  F.ig.l) 

6.3  Symmetric  binary  information  matrices.  This  is 
a  special  case  of  (6.2.1),  with 

Pi  *  P2  *  ?  > 

say.  The  convention  (6.2.2)  becomes 

■?  >  %  , 

and  it  follows  from  the  theorem  of  the  preceding  section 
that  the  information  value- is  non-dscr - ng  in  o:an  intui¬ 
tively  obvious  result.  On  Fig.  1,  the  symmetric  matrices 
are  represented  by  the  line  (not  drawn)  connecting  , %)  and 

(1,1)  . 
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6.4  Benefit  matrix  and  information  value:  the  case 
of  two  actions.  As  stated  in  Section  1.8,  no  two  rows  and 
no  two  columns  of  the  matrix  0  =  [0  ]  are  identical;  and 

any  action  represented  by  a  dominated  row  is  eliminated. 

If  after  such  elimination  there  remain  two  rows,  i.e.  A=(l,2) 
there  is  no  loss  of  generality  in  writing 


(6.4.1)  P=|>‘az]=s 


b2"r2 


\ 


r  >  0,z=l,2  ; 
2 


the  r  are  often  called  "regrets"  (about  not  having  used 
z 

the  action  a=z  ,  optimal  under  certainty) .  This  benefit 

matrix  is,  in  effect,  used  in  statistics  when  the  two  actions 

ares  "reject  the  hypothesis"  and  "accept  it;"  the  r  are 

z 

then  penalties  for  committing  an  error  of  first  or  second 


kind. 


For  brevity,  write 

(6.4.2)  q.  «  1  -  Pi  ,  i  ■  1,2  . 

VTith  both  T)  and  0  of  order  2x2  the  value  of  information 
is,  by  (  3.7.?)  ,  (  3.7.5/ 

V(T])  =  V^TD+VjfTl) ,  where 

vl™-  =  max  <@u  f21,lPl+S22¥2l 


V2(T1)  =  max  O11ir1ql+0127r2p2'  P217rlql+^227r2  P2*  ; 

then  by  (4.3.1 )  ,  (6.2.6) ,  (6.2.5) ,  (6.4.1) ,  the  value  of  perfect 

and  of  null-information  are,  respectively 
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VraaX  =  Vll+7r2322 

vmin  =  max  (ir1311+T)’23l2>  7ri^21+7r2^22)=vmaX  “®in(irirj  >^T2r2^  * 
(Note:  If  we  considered  inquiry  costs  fixed,  the  com¬ 
parison  between  expected  utilities  (not:  benefits)  of  inquiries 
would  not  be  affected  by  putting  bj=b2=Vmax  =  0.  This  is 
usually  done  in  statistics) . 

The  "weighted  regrets"  v  r  ,z=l,2,  are  by  (6.1.2),  (6.4.1) 

2  Z 

always  positive,  and  we  can  label  the  events  jz  so  that, 
without  loss  generality 

r2r2  ^  >  0  * 

Then 

ymin  =  vmax  .  ^ 

V  (Ti)  =  v™3*  -  min  (ir1r1,7r1r1q]L+Tr2r2q2)  „ 
by  (6.2.?)  . (6.4.2)  . 

Hence,  remembering  (6.4.2), 


(6.4.3) 


V(T)) 


Vmln  if  ir1riP1+ir2r2P2  -  ir2X2 
Vmax-L  (T))  otherwise ; 


where  L(T))  ,  the  loss  due  to  imperfection  of  information,  is 
(6.4.4)  L (T))  =  7r1r1(l-p1)+Tr2r2(l-p2)  . 
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Thus  all  inquiries  such  that 

1rlrl?l+7r2r2P2  —  7r2r2 

have  the  same  value  as  null-information.  They  constitute  a 
"useless'1  indifference  set.  In  the  (p^p^)  -plane,  all  other 
indifference  sets  are  straight  lines  parallel  to  the  line 

(6.4.5)  VlPl+^V?  =  V2  > 

which  bounds  the  triangle  representing  the  useless  set.  See 

Figure  2.  The  information  value  as  a  function  of  (p.,o_)  over 
the  region  (6.2.2)  is,  then,  represented  by  a  horizontal  plane 
and  an  upward  sloping  plane,  intersecting  along  the  line  (S.4.5). 
If  !)  is  symmetrical,  pj=p2=p>  the  above  results 

become : 

V®10  ifp<Tr2r?/(7r1r1+Tr2r2) 

(6.4.5)  V(T))  = 


vmax„^7_^r^+7r^r^j  (i_p)  otherwise^ 


Thus,  the  information  value  of  symmetric  binary  information 
(in  the  case  of  two  actions) ,  if  plotted  against  the  pro¬ 
bability  p(j>  k) ,  consists  of  a  "useless"  horizontal  segment 
till  £  reaches  a  certain  bound;  and  is  a  positively  sloped 
straight  line  for  larger  £  .  Soa  Figurc  3a. 
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6.5  The  case  of  more  than  two  actions.  If  the  number  of  actions 
exceeds  two,  the  value  of  a  binary  inquiry  need  not  be  (piece -vise)  linear 

in  the  probabilities  p^,p0;  and  the  indifference  curves,  including  the 

*/ 

one  bounding  the  "useless"  region  need  not  be  linear-.  In  fact,  the 
indifference  curves  can  become  strictly  concave  (quite  unlike  those  of 
consumer  theory).  This  can  be  shown  by  inserting  appropriate  numerical 

*/  The  non-linearity  of  indifference  curves  in  the  considered  case  of 
more  than  2  actions  contradicts  a  statement  of  L.J.  Savage  [1962]  who 
obtained  parallel  straight  lines  of  equal  information  values  in  the  plane 
(p^Pg), presumably  for  any  number  of  actions.  Consider  the  following  two 
objects:  l)  an  inquiry  ~  =  (p^Pg)  (i.e.,  a  binary  inquiry  with  7^=  p^, 
T>22-  V2),  and  2)  a  gamble,  which  we  shell  denote  by  g  »  (p*  ^"jor), 
and  which  gives  you  access  to  inquiries  ?*  =  (p^,p^)  and  T)"  =  (p^Pg)> 
with  odds  a :  ( 1-c; ) .  Savage  considers,  in  effect,  f]  and  £  as  identi¬ 
cal  objects,  provided 

(*)  Px  =  c/p^  +  (l-o  )p^,  r>?  =  orr>'2  +  (l-orhg  • 

It  is  true  that,  if  T|’  and  T‘"  have  equal  values,  then  g  has  the 
same  value.  For  a  decider  in  possession  of  g  will  respond  to  observa¬ 
tion  y  by  the  actions  a1  and  a"  with  respective  probabilities  a 
i  y  I*  y 

and  1-orj  where  a  and  a  denote  the  actions  appropriate  to  y  when 
y  y  * 

the  inquiry  is  or  7i",  respectively.  Hence,  for  the  possessor  of 

g,  observation  y  has  value  [in  the  sense  of  (3*7.2)] 

Vy(g)  =  «Vy(Tl»)  +  (l-o)Vy(r),  y  =  1,2  , 

Therefore  V(g)  =  V^g)  +  Vg(g)  =  oCV^T;*  )+V2(-»  )]  +  (l-cOC^Of  )+V2(tT  )] 

=  oV(7)»)  +  (l-o? )V(r" );  so  that,  if  V(T|» )  =  V(t,"  )  =  V,  say, 
then  indeed  V(g)  =  V. 

If,  in  addition,  condition  (*)  would  imply  that  y  and  T]  are  .  'cri¬ 
tical  objects,  then  i^wotJLd  indeed  irrply 

v(b)  =  V(e)  =  7  =  V(T1‘  )  =  V(t.")  , 
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bo  that  the  points  representing  T|,  “nS  '0"  would  lie  on  the  same  indif¬ 
ference  line.  This  line  would  "be  straight  since  "by  (*),  (p^p2) 

lies  on  a  straight  line  between  (P-J/Pg)  and  (P-j/Pjj)*  And  all  such 
lines  are  parallel  since  you  would,  by  the  same  reasoning,  be  indifferent 
between  gambles  such  as  (T)*,"i}*;a)  and  (TV',T>*;ar). 

However,  I  don’t  think  that  condition  (*)  makes  the  objects  T|  and 
g  identical,  or  their  values  equal,  Die  possessor  of  1)  will  respond 
to  observation  y  by  seme  appropriate  action  a  (say)  which  has,  in 
general,  no  relation  to  the  actions  a'  and  a  which  are  appropriate 

y  y 

when  the  inquiry  is  Tj*  or  respectively.  There  is  therefore  no 

necessary  equality  between  V(7l)  and  V(g),  and  hence  none  between 
V(T1)  and  V(Y)  and  V(lf). 

END  OF  FOOTNOTE 

values  into  the  3x2  benefit  matrix  (3  treatments,  each  with  cancer 
present  or  absent),  mentioned  in  Section  3.^.  To  permit  the  use  of 
calculus  consider,  instead,  a  case  in  which  actions  constitute  a  closed 
non-countable  set,  0  £  a  £  1,  and  the  benefit  function  R(a,z),  new 
written  Bz(a)  for  convenience,  is  twice  differentieble  with  respect  to 
a.  As  before,  Z  =  (1,2).  Let  B^a)  increase,  and  Bg(a)  decrease, 
in  a.  Then  a  >  a*  implies  ^(a)  >  ^(a’ )  and  S2(a)  <  ^(a1), 
hence  no  action  is  dominated. 

The  following  example  will  impose  seme  further  constraints.  A 
farmer  wishes  to  maximize  the  amount  harvested.  He  must  decide  on  how 
to  allocate  his  total  acreage  (=l)  between  two  crops,  the  "wet"  crop 
being  favored  by  vet  weather  (denoted  by  z  =  l),  and  the  "dry"  crop 
by  dry  weather  (z  =  2).  The  action  a  is  the  acreage  allotted  to  the 
vet  crop.  Let  the  harvest  from  c  acres  of  a  given  crop  when  the  weather 
is  or  is  not  favorable  to  it,  be,  respectively,  f(c)  and  g(c).  The 
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combined  harvest  of  the  wet  ar:d  the  dry  crop  is  then 

0^(a)  =  f(a)  +  g(l-a)  in  wet  weather 
02(a)  =  f(l-a)  +  g(a)  in  dry  weather  . 

It  is  natural  to  assume  that  0^  increases,  and  Bg  decreases,  in  a, 
so  that  no  action  is  dominated,  and 

Q^(a)  *  f*(a)  -  g*(l-?.)  >  0;  Pg(a)  =  -f«(l-a)  +  g*(a)  <  0  . 

Finally,  if  hoth  crops  obey  the  "law  of  decreasing  marginal  returns,  " 
f"(c)  <  0,  g"(c)  <  0,  then 

8^(a)  =  f"(a)  +  g*'(l-a)  <  0;  ^(a)  =  f"(l-a)  +  g"(a)  <  0  > 

write 

(6.5.1)  Yz(a)  2  n2Bz(a),  z  =  1,2  ; 

then,  since  ttz  >  0, 

v[(a)  >  0;  YgCa)  <  0 

(6.5.2)  Yz(a)  <  0  ,  0  =  1,2  . 

This  is,  in  fact,  the  only  additional  constraint  we  neea,  to  chow  that 
the  information  value  V("])  is  strictly  convex  (and  therefore  strictly 
auasi-convex);  that  is,  as  P1>P2  vary,  the  second  differential  d~V  >  0 
(and  therefore  the  indifference  lines  are  concave  A  sufficient  condition 

*/  A  twice-differentiable  and  Increases  function  F(x)  of  a  vector  x 
is  called  strictly  concave  over  its  sub-domain  X  if,  in  that  sub-domain, 
dr  <  Ojg  clearly  such  a  function  is  also  strictly  quasi- concave  over  X, 
i.e.,  d'T  <  0  holds,  in  particular,  for  all  x  in  X  such  that  dF  =  0. 
Ihe  contour  lines  of  equal  values  of  a  quasi-concave  F  are  convex:  see, 
e.g.,  Arrow  and  Enthoven  [1961].  Now,  replacing  "<"  by  the  defini¬ 
tions  of  F  strictly  convex  and  strictly  quasi -convex  (with  contour  lines 
concave)  are  obtained. 


p.  6.10 


for  this  is: 


(6.5.3) 


WU  >  °'  >  0, 


22 


;nV|>0 

M  .  >  o  , 

1V21  vP2t 


■where  (6.5.3')  wi()  s 


To  evaluate  the  v.  ,  note  that  the  expected  benefit  of  action  a  vhen 
the  observation  is  y  is 


3  (a)  a  S  y„(a)  T\  .  y  *  1,2  ; 
y  z=l  "  zr 


the  value  of  the  observation  £  is 


vy  "  “*  Va)  =  W  ■  Yi(ay)li.y  +  3  ”  1'2  * 


(6.5.M 


V 


«» 


St"J  3  Yl(ay)T)l,y  +  v2(ay)T2,y  =  0  ' 

&S& 


since  by  (6^5.2)  and  vith  all  V,  .  positive, 

zy 


(6.5.5) 


V  S 

y 


A ■  (a) 


di  a=a 


1  .  *  Y>y>%,y  +  v?(ay)T|2,y  <  0 


In  terns  of  p^Pg,  emphasizing  that  Vy  depends  on  ay, 

(6.5.6)  Vx  =  r1(a1)  =  Y1(»1)P1  +  Y2(a1)(1"p2^  V2  =  Va2^  =  Yl^a2^1''Pl^ 

+  Y2(a2^2  ; 

(6.5.T)  =  Vx(ax)  =  vx(a1)p;L  +  Y2(a1)(l-P2)  =  °i 

y*2  =  Y2^a2^  *  Yl^a2^^”Pl^  +  Y2^a2^2  =  ®  * 
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Write  =  a^;  then  differentiating  (6.5*6)  vith  respect  to  y^ 

using  (6.5.7) 

S  V^i  s  Vi^ai^eii  +  Yi^ai^  *  0  +  vi^ai^ 

'°V*i  ‘  v?Yy  -  VY  * 0  •  VY ;  1  * 1  > 


and  since  V  =  +  Vg  , 


SV/dP1  =  Y-^)  -  dV/3p2  ~ 

then  "by  (6.5.3*)  and 
writing  Y^ttj)  =  Y±J  * 


-Y2(*i)  +  Y2(n2)  i 


(6.5.8) 


W11  =  Yllall  "  v12a21  V12  =  Yll^  "  Y12B22 


V21  *  *  Y21®11  +  Y22®21  V22  =  "Y2lel2  +  Y22a22  J 


(it  will  "be  confirmed  presently  that  v^-,  =  Vg^).  10  the  a^, 

differentiate  with  respect  to  p^  the  equations  (6.5.7)*  which  are 


identities  in  the  a^: 


-  0  -  •  «u  *  VU  3V’/foa  "  0  .  v:  V  -  Ya 

dVs^h  =  o  »  Va  •  .a  .  Yl2  .  o  .  v”  ^  _  Ya>  , 


6olve  for  Hie  a..*  writing  for  brevity 

ti* 


(6.5.9) 


v:  d  1/kj  <  0,  J  =  1,2  [by  (6.5.5  ]j 


ajj  -  •  Vjj;  Y  ■  Yu'  1  * 1 1 


then 
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and  by  (6.5.8),  (6.5.9)# 


2  2 

V11 =  'vnKi'Yi2ki 


V?2  ’  'Y211:l“Y22it2  >  ° 


V12  =  YllY21kl  +  V12V22k2  ~ 

W11V22-  W12  =  *1*2^11 ''22*  Y12Y21)2  *  0  * 

thus  establishing  condition  (6.5.3),  -v-fficient  for  +  Vg  to  be 
strictly  convex  in  p^Og. 

V  most  not  be  less  than  the  value  cf  null-information  which  is 
V0131  -  max  Cy-,  (a)  +  Y2(a)l  =  Yjfc*)  +  Y2(a*)  > 


where 


(6.5.10) 


Yj(a*)  ►  Y2(a*)  =  0  > 

Y^(a*)  +  Yg(a*)  <  0  by  (6.5.2)  . 


Ii  cn  inquiry'  lc  useless,  u*.  by  (c.5,7),  (6.5.10), 

+  Pg  =  1»  Bn15  the  set  of  useless  inquiries  coincides  with  that  or 

null-inquiries,  represented  by  the  min  diagonal  of  the  unit-square. 

(See  conjecture  in  Sorties  3»15«)  .  .  ...... 

For  a  simple  example,  let  the  prior  weather  probe  Dili  oies  be  tt^  = 

=  \  and  let  the  production  functions  g,f  (for  the  crop  not  favored 

or  favored,  respectively,  by  weather)  be 


(6.5.11)  g(c)  =  (3c  -  c2)/8;  f(c)  =  3g(c);  c  *  a  or  1-a  . 


^(a)  *  2yx(a)  «  -&2+  6  +  p  P2(a)  *  2Y2(a)  =  +  i 


2^3 


Then 
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Apply  this  to  (6.5.6),  and  solve  (6.5.7)  for  the  a^,y  =  1/2  (optimal 
acreages  of  wet  crop),  vriting  qi  «  1-p^.  Ihen 

ai  *  piA/pi  +  4g);  a2  c  %Aql  +  *  J  * 

(6.5.12)  \  +  V2  =  p^/Mpx  +  qg)  +  ^A'%  +  i?)  +  £  • 

It  is  easily  seen  that 

V1  +  V2  =  ^  =  5/8 

if  and  only  if  p^+  -  1.  Ihus  all  useless  inquiries  are  null- inquiries, 

represented  hy  the  diagonal  p^+  pg  =  1.  All  indifference  lines  in  the 
space  above  the  diagonal  are  strictly  concave  since  in  (6.5*12) 

is  strictly  convex  in  p^.pg  P^+  Pp  >  1* 

In  the  symmetric  case,  i.e.,  vith  =  P2  2  P  5  1-4  2 
example  yields 

®1  =  P'  a2  *  * 

v  «  \  +  V2  =  (p2  +  q2)/4  +  i  / 

so  that,  plotted  against  £,  the  information  value  V  is  represented 
’ey  a  strictly  convex,  rising  curve  (a  parabola  vith  a  minimum  at  p  =  ■§}• 
See  Figure  3b. 
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6.6.  Cost  conditions.  So  far,  we  have  explored,  at 
least  for  the  case  of  binary  information  matrices,  the  be¬ 
havior  of  the  information  value  function  V(T|)  which  asso¬ 
ciates  each  1)  with  the  maximum  expected  benefit.  If  uti¬ 
lity  can  be  represented  as  the  difference  between  benefit  and 
information  cost,  an  optimal  matrix  T)  maximizes  the  differ¬ 
ence  between  V(T))  and  the  expected  information  cost,  subject 

to  a  constraint  on  feasible  pairs  (1),y)  of  inquiries  and 

cost  functions  (Section  3.6), 

01  »Y)  €{(T],v)). 

A  simple  assumption  is  to  associate  each  11  with  just  one 
cost  function  v  01),  viz.,  the  one  giving  the  lowest  expected 

cost  [as  in  (3.'».6)J,  for  a  given  11  .  In  addition  we  shall 

make  v  (T) )  independent  of  z  : 

Yz01)  =  y0!>  , 

3ay.  Thus,  if  11  is  obtained  by  a  sampling  survey  of  families, 
the  cost  y(11)  will  depend  on  the  size  of  the  sample  needed 
to  obtain  1|  (i.e.,  to  attain  some  preassigned  error  pro¬ 
babilities)  ;  but  not  on  the  properties  of  the  families  — 
disregarding,  for  example,  the  fact  thr.t  households  of  certain 
types  may  require  second  visits. 
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Using  these  simplifying  assumptions,  and  still  confining 
ourselves  to  binary  information  matrices,  we  shall  give  two 
examples  ill  istrating  the  possible  behavior  of  the  cos'-  func¬ 
tions  y(11)  .  An  important  question  is:  under  what  conditions 
does  the  expected  utility,  as  a  function  of  1|, 

(6.6.1)  U(7))  =  V (T) )  -  y(TJ) 

behave  in  such  a  way  that  the  optimal  information  matrix  is 
an  interior  solution".  If  it  does  not,  the  optimal  binary 
information  matrix  may  be  the  perfect  information, 

(Pl,P2)  =  (1,1):  a  case  of  "large  scale  economics,"  making 
the  competitive  market  equilibrium  non-optima  1  from  the  point 
of  view  of  social  welfare.  Thus  when  VCH)  is  quasi-convex, 
i.e.  the  indifference  curves  are  concave,  the  existence  of 
interior  solution  require  that  the  lines  of  equal  cost  be 
also  concave,  with  even  larger  curvature.  This  requirement 
would  mean,  in  the  case  of  binary  symmetric  matrices,  with 
p(  i)  replacing  71  in  (6.6.1)  i:i  an  obvious  manner,  that, 

(6.6.2)  V * (p)  =  y' (p)  should  imply  V" *p)  <  y" (p) . 

6.7  Cost  linear  in  channel  capacity.  The  capacity 

C  =  C(p1,p2)  of  a  channel  transmitting  one  bit  per  time  unit 

(see  below.  Section  7.4  )  is  given*^  by 

^2Hl’qlH2)/(ql“p2)  9(plH2"q2Hl)/(ql"p2) 

whore  Ki=  -(pilog2pi+  qjlog^) ,  i  =  1,2  . 

~ 

Soe,  e.c.  Ash  (1965),  Theorem  3.3.3  ,n.56)  and  problem  3.7 
(p. 304)  .  C  is  the  conventional  symbo*  channel  capacity. 
In  Section  3.2,  C  was  introduced  to  denote  expected  inform¬ 
ation  cost,  which  we  shall  here  assume  linear  and  increasing 
in  channel  capacity.  Apologies  to  the  reader  of  this  mimeo¬ 
graphed  paper  for  this  inconsistence  in  notations.  In  this 
section,  cost  and  expected  cost  are  both  *  y(p)  . 
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C  is  quasi-convex  in  contour  lines  of  equal 

capacity  are  strictly  concave  for  p^+p2  ^  *  •  a^-  P°ints 

on  the  straight  line  p^+p2  =  i  have  equal  capacity  C  -  0? 

and  maximum  capacity  is  C(l,l)  =  1  .  See  Fig.  4. 

Suppose  that  the  indifference  lines  (contour  lines  of 

equal  information  value)  are  strictly  c  :r.  .ave,  as  in  the 

"farmer's  case"  of  Section  6.5.  Suppose  further  that  the 

’observations"  y  =  1,2  are  messages  ("wet",  "dry’)  received 

through  a  channel  whose  inputs  are  the  "true1  events  (viz. 

actual  future  weather),  z  =  1,2  .  And  suppose  information 

An  optimal 

cost  increases  linearly  with  channel  capacity./  information 
system  (consisting  in  this  case  of  the  channel  and  nothing 
else)  is  an  "interior"  optimum  if  the  optimal  pair  (p^,p2) 
is  such  that 

not  :  pj+p2=l,  or  p-,=l,  or  p2=l  • 

This  requires  that  the  contour  lines  of  equal  capacity  be "more 
concave"  (i.e.  have  greater  curvature)  than  the  indifference 
lines . 

In  the  symmetric  case,  p^=  P2  =  P  i;  1  _<3  >  the  channel 
capacity  is 

C  =  1  +  p  log2p  +  q  log2  q  . 

If  the  channel  cost  y(p)  (measured  in  our  farmer's  harvest 
bushels  or  dollars)  is  increasing  linearly  in  C  then  the 
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expected  utility  is 

U  =  V-  r  C  -  const,  (r  >  0) 

U  =*  -i(l-p)  -r[p  log^p  +(l-p)log2(l-p)  ]  -s  , 

say.  It  will  depend  on  the  constants  r,s,  whether  condi¬ 
tion  (6.6.2)  is  fulfilled  and  thus  an  interior  solution  exists. 

6 . 8  Cost  of  .-inferring  sicn  of  mean  of  finite  population 
from  si  on  of  mean  of  somole.  Suppose  n  random  variables 
u^(i=l, . . .,r.)  are  jointly  normal,  with 

1  ** 

E(u.)=  0,  £(u.u.)=  if  j  i  . 

1  1  3  0  + 

Define  the  events  z  and  the  ' observations"  (usually  called 
'statistics")  y  by 

1  n  >  1  m  > 

z  —  if  2  u.  0  ;  y  *  if  2  0  , 

2  1  1  <  2  l  > 

1  <  m  ^  n  ; 

thus  m  is  the  size  of  the  sample,  and  n  is  the  size  of 
the  population.  Then  (see  Cramer  [1946] ,  p.290)  the  joint 
distribution  of  z.  and  y  9^ven  by 

Pr(z=l,  y=l)=?r (z=2,  y=2)  ^1/4  +  (arc  sin  p)*/2  ir 
Pr (z=l,  y=2}=?r(z=z,  y=l)=l/4  -  (arc  sin  p)/2  v  , 
where  p-Vra/n  .  Hence  1)  is  binary  symmetric,  with 
T)  n=pr(y=l|  z=l)=i  +(arc  sin  /m/n)  A  =  ^22~  p  >  • 

**See  Marschak  [1964],  equation  (55).  The  example  given  in 
that  paper  has  n  stocks  in  a  portfolio  and  a  sample  of  m 
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of  them  .  It  seems  more  difficult  to  think  of  a  finite  popu 
laticn  (finite  number  of  possible  future  states  of  humidity) 
in  the  case  of  our  "farmer.''  With  population  infinite  the 
optimal  sample  size  is  convex  in  *1^-  P  (instead  of  ex¬ 
hibiting  an  inflexion) ,  so  that  the  example  would  not  add 
much  to  that  of  the  preceding  Section.  END  OF  FOOTNOTE 


m  =  n  sin  -rr(p-i) 
dm/dp  -  w  n  sin  7r(2p-l) 
d2m/dp2*  27r2n  cos  7r(2p-l/  ^  0  if  p  £  3/4  . 

The  sample  size  m  is  thus  an  increasing  function  of  £  , 
convex  for  small  (and  hence  less  informative:  Section  6.3)  , 
values  of  £  ,  and  concave  for  larger  ones.  So  is  the  cost 
of  information  (sampling  cost)  if  we  assume  it  to  increase 
linearly  with  m  .  Therefore  a  whole  range  of  sufficiently 
small  values  of  ja  (and  therefore  of  m)  is  non-optimal, 
especially  if  information  V(p)  is  strictly  convex  in  %>  . 
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7.  ECONOMICS  OF  COMMUNICATION 

7.1  The  fidelity  criterion  as  benefit.  In  the  pro¬ 
ceeding  three  Sections,  the  benefit  ^(a,z)  depends  on  the 
"action”  a.  in  A  and  the  "event”  (or  hypothesis*)  z.  in 
2  .  P  probability  function  ir  is  defined  on  3  .  Event  z 
is  transformed  into  "observation"  ^  by  a  processing  T)  ; 
and  y.  transformed  into  a  by  a  subsequent  processing 
called  strategy  (these  processings  are  possibly  stochastic) . 

Now  let  us  interpret,  instead,  z  in  Z  as  a  "message 
sent",  occurring  with  probability  ir  .  Interpret  processing 
T)  as  "communication"  (to  be  specified  later  as  a  chain: 
storing,  encoding,  transmitting) :  it  transforms  message  z 
into  £  ,  the  latter  to  be  interpreted  as  some  signals  re¬ 
ceived  by  the  decision-maker.  An  important  restriction  is 
this:  the  set  A  of  actions  a  is  identical  with  the  set 
Z  of.  messages  sent.  The  Strategy  a  consists  then  in  a  rule 
of  "decoding"  the  received  signals  yr  ,  i.e.,  in  prescribing 
which  element  a  of  Z  (cr  which  conditional  distribution 
of  a)  should  be  associated  with  a  given  £  . 

The  early  writings  on  communication  theory  —  most  im¬ 
portantly  the  pioneering  work  of  Shannon  [1948]  -  imposed  a 
further  restriction,  by  assuming  equal  penalty  for  all  commu¬ 
nication  errors,  so  that  "a  miss  is  as  bad  as  a  mile."  That 

is,  the  benefit  function  is  taken  to  be  simply 

0 

(7.1.1)  3(a,z)  =  if  a  ^  z  , 

*See  second  paragraph  of  Section  3.1 
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i.e.,  0{a,z)  »  1  minus  Kronecker  delta.  Then  the  expected 

benefit  is,  by  (3.6.3)  and  (7.1.1) 


B^(D,a) 

»  -  2 


2  2  6 (a,  z)  vz  l)2y  aya 


Sir'll  a  =  -  p  . 
z+a  y  2  2y  ya  e 

where  pe  denotes  the  ’’probability  of  error".  For  a  given 


set  Z  (characterized  by  ir) ,  p  depends 


on  the  pro¬ 
perties  of  the  communication  processing  1)  and  the  decoding 
strategy  a  . 

However,  the  special  restriction  (7.1.1)  was  abandoned 
later,  when  Shannon  [I960]  introduced  the  "fidelity  criterion” 
(and  its  negative,  the  "distortion"),  a  general  real-valued 
function  of  the  message  sent  and  the  message  decoded.  This 
function  is  identical  with  our  general  benefit  function  that 
maps  2  x  a  into  reals;  except  for  the  restriction  (mentioned 
above)  that  replaces  2  x  A  by  2  x  2  .  a  fidelity  criter¬ 
ion  does,  then,  assign  different  penalties  (negative  benefits) 
to  different  errors  of  communication  and  decoding.  This  idea 
has  not  yet  penetrated  the  bulk  of  literature,  certainly  not 

the  textbooks,  on  communication  theory.  ' 

*) 

But  see,  more  recently,  jellinek  f 19C 3]  and  Pham-Huu-Tri 
[1968] .  The  coding  procedures  recommended  by 

Shannou  to  maximize  expected  fidelity  can  be  made  more  effi¬ 
cient  in  several  respects.  END  OF  FOOTNOTE 


7.2  Capacity  of  noiseless  channel,  we  mentioned  in 
Section  3.2  that  statistical  decision  theory  neglects  delays 
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in  processing.  Communication  theory  does  not  neglect  them. 
Concepts  like  the  speed  of  a  processing (thruput  per  time  unit), 
and  the  maximum  of  this  speed,  achievable  with  a  given  pro¬ 
cessing  instrument  and  called  its  capacity,  arise  naturally. 

As  a  simple  case,  imagine  a  noiseless  transmission  channel, 
are 

Its  inputs/sequences  of  symbols  such  as  dots  and  dashes,  or 
numerical  digits.  Let  us  call  them  digits.  They  are  the 
outputs  of  the  preceding  processing  link,  the  encoding,  to  be 
discussed  in  thn  next  Section,  7.3.  The  digits  are  trans¬ 
mitted  through  the  channel  one  by  one  and  received  at  the 
other  end  with  no  distortion.  If  the  chrwnel  is  a  cable  con¬ 
sisting  of  several  wires,  several  symbols  can  be  transmitted 
simultaneously.  T<;e  can  therefore  diminish  delays  by  increasing 
the  number  of  wires,  which  thus  measures  the  channel's  capa¬ 
city:  the  maximum  number  of  digits  that  can  be  transmitted 
per  unit  of  time. 

Channel  capacity— already  in  the  noiseless  case — is 
economically  significant  for  two  reasons.  First,  if  the  inflow 
of  input  digits  per  time  unit  exceeds  the  channel  capacity, 
untransmitted,  and  therefore  useless,  inputs  will  pile  up  in¬ 
definitely,  with  an  obvious  detriment  to  the  exoected  benefit. 
Second,  any  further  increase  of  capacity,  in  excess  of  the 
inflow  of  inputs,  will  diminish  the  delay  between  input  and 
output  of  the  channel. 
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VThy  delays  can  diminish  expected  benefit,  is  due  to 
"impatience"  (preference  for  early  resu.ts  of  actions)  as 
wall  as  to  the  obsolescence  of  data— i.e.,  in  our  case,  of 
the  channel  outputs, -on  which  the  choice  of  action  is  based. 
This  was  discussed  in  Sections  1.  and  5. 

T<1hile  increased  channel  capacity  thus  increases  expected 
benefit,  it  will,  in  general,  also  require  an  increase  in 
cost. 

Expected  benefit  is  diminished  by  delay.  But  benefit  is 
not  necessarily  a  linear  function  of  delay.  Hence  (see 
Appendix  I)  expected  utility  (difference  between  expected 
benefit  and  expected  cost)  is  not  monotone  in  expected  delay. 
Therefore,  it  is  nr»t  correct  to  present  the  economics  of  commu¬ 
nication — even  in  the  simplest  case  of  a  noiseless  channel — 
as  that  of  minimizing  expected  cost  for  a  given  expected  de¬ 
lay,  or  expected  speed  of  transmission.  Yet,  just  this  seems 

the 

to  be  done,  in  this  or  similar  contexts,  in  much  of/literature, 
where,  essentially,  the  problem  is  presented  as  that  of  deter¬ 
mining  an  efficient  set  in  the  space  cf  expectations  of  various 
’criteria."  *] 

The  clearest  formulation  of  such  an  efficient  set  is  given 
by  Wolfowitz  [1961],  in  the  context  of  optimal  coding  for  a 
noisy  channel.  It  seems  that  the  assumption  of  utility  linear 
in  its  criteria  is  implicit  in  the  discussion  of  optimal  de¬ 
sign  in  many  fields  of  engineering.  See,  e.g.  English  [1968]. 
BHD  OP  FOOTNOTE 
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7>3  Minimum  expected  length  of  code  word,  as  the  "uncertainty  at 
source.”  If  only  two  possible  messages  z  (=1  or  0,  say)  can  be  sent, 
each  can  be  encoded  as  a  single  binary  digit,  to  be  transmitted  through 
the  channel.  However,  if  a  time  sequence  of  T  such  two- valued  messages 
is  to  be  cojanunicated,  less  than  T  digits  (and  hence  less  than  one 
digit  per  message)  will  a  needed  on  the  average  if  one  uses  "code  words" 
(binary  sequences)  with  few  digits  for  the  more  prob^e  and  with  more 
digits  for  the  less  probable  sequence  of  messages,  ror  example,  if  one 
uses  this  principle  and  if  the  odds  for  z  taking  its  two  values  are 
9:1,  then,  even  if  the  sequences  of  messages  occur  indepencently  ("have 
no  pattern"),  it  is  possible  to  devise  codes  which  will  use,  on  the 
average,  approximately  only  .64  or  .53  digits  per  message  when  T  =  2 
or  T  =  3,  respectively.  In  general,  as  established  by  "Shannon's 
first  theorem,"  the  minimum  expected  length  of  the  code  word  decreases 
as  T  increases,  and  it  converges  towards  the  (never  negative)  quantity 

(7.3«l)  -  £  tt  logoff-  =  H(tt),  also  written  as  H(z)  . 

zfZ  z  dZ 

This  limit  is  valid  not  only  for  the  c«u.?  of  two-valued  messages  (as  in 
our  example,  with  H(tt)  =  .47)  but  for  a  set  Z  of  any  size  m.  Since 
H(tt)  is  largest  when  all  the  m  elements  of  Z  are  equiprobable  [bo 
that  every  tt  =  l/m,  and  H(tt)  =  m],  the  name  "amount  of  uncertainty" 
(about  z)  occasional!/  given  to  H(tt)  is  indeed  a  suggestive  one. 
Alternatively,  one  seys  that  H(tt)  units  of  information  are  gaii ’d  if 
this  uncertainty  is  removed  (by  learning  the  actual  value  of  z).  Indeed 
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H(tt)  has  teen  proposed  as  &  "iteasvH1  of  uncertainty >  or  of  information, 
'because  it  is  additive  in  the  folio  ding  sense.  Let  tt’,  n"  characterize 
two  statistically  independent  sets  Z’  and  Z”;  that  as, the  joint  occur¬ 
rence  z  *  (z!  and  z")  cf  given  messages  from  the  two  sets  occurs  with 
probability 

t  " 

nz  ~  TTz«  *  V  * 

then,  by  the  definition  (7.3.1)  of  the  distribution  parameter  H, 

(7*3.2)  H(tt)  =  H(tt’ )  +  H(rr")  . 

Similar  additivity  properties  are  •  derived  for  certain  related  distribution 
parameters  (such  as  " uncertainty  removed  by  transmission,"  of  which  more 
later).  Since  H(tt)  measures  the  average  length  of  a  sequence  of  binary 
digits,  the  measurement  unit  of  "uncertainty”  (or  its  negative,  "infor¬ 
mation")  is  called,  briefly,  a  bit,  following  a  suggestion  of  J.W.  Tukey. 

It  is  not  clear,  however,  for  what  economic  purpose  one  should  meas¬ 
ure  uncertainty,  or  information.  Because  of  the  additive  property 
(7.3*2)  of  the  d  *»trJbution  parameter  H,  specialists  in  various  fields 
(mathematics,  statistics,  psychology)  expressed  enthusiasm:  the  subtle, 
intangible  concept  of  information  has  now  become  measurable  "in  a  way 
similar  to  tvat  as  money  is  used  in  everyday  life"  (Renyi  [1966-]). 

Indeed  a  paper  currency  Dili  can  be  measured  by  the  number  of  dollars  it 
represents,  and  thus  by  the  amount  of  some  useful  commodity  at  a  given 
price.  But  it  can  be  also  measured  (a  peso  and  a  hundred  peso  bill 
alike)  in  square  inches  of  its  area.  If  I  vise  it  for  papering  ay  vails, 
the  latter  not  the  former  measurement  Is  appropriate! 


p.  7-7 


Somewhat  anticipating  the  subsequent  more  detailed  discussion, 

note  that  a  distribution  parameter  such  as  H(n) 
cannot  alone  determine  the  information  value  of  a  system.  For  H(rr) 
depends  only  on  the  distribution  n,  not  on  the  benefit  function  9. To 
be  sure,  the  special  assumption  (7-1.1)  of  equal  penalty  for  all  coman- 
nication  errors  dees  remove  variations  of  the  benefit  function.  This 
fact  may  have  been  the  source  of  misunderstandings  about  the  economic 
significance  of  the  number  of  bits  gained  or  lost,  regardless  of  the 
use  the  decision-maker  can  make  of  them.  If  a  general  fidelity  cri¬ 
terion  (presumably  reflecting  the  decision-maker's  needs)  is  introduced, 
H(it)  fails  to  determine  the  information  value  of  the  system. 

What  is_  economically  important  about  H(tt)  is  its  meaning  as  the 
lower  limit  of  the  expected  length  of  a  code  word,  given  the  distribution 
tt.  For,  the  shorter  a  code  word  the  less  is,  presumably,  the  time  needed 

to  transmit  it,  digit  by  digit;  and  therefore,  for  reasons  Just  stated 

*! 

in  Section  7.2,  the  larger  the  expected  benefit.-' 

*J  Wolfovitz  [  19613  writes  that  the  function  H  should 

"for  convenience  and  brevity  have  a  name.  However,  we  shall  draw 
no  implicit  conclusions  frees  its  name,  and  shall  use  only  such 
properties  of  E  as  we  shall  explicitly  prove.  In  particular, 
we  shall  not  erect  any  philosophical  ys terns  on  H  as  a  founda¬ 
tion.  One  reason  for  this  is  that  ve  shall  not  erect  any  philo¬ 
sophical  systems  at  all,  end  shall  confine  ourselves  to  the  proof 
of  mathematical  theorems," 

namely,  theorems  on  optimal  ceding.  The  present  writer,  though  guided  by 
economic  rather  than  mathematical  interest,  tends  to  agree. 


On  the  other  hand,  note  that,  to  bring  the  expected  length  of  code 
words  down  close  to  its  lower  limit,  H(tt),  one  may  have  to  vait  till 
a  vary  long  sequence  of  messages  (T  large)  is  piled  up.  The  resulting 
delay  may  offset  +te  acceleration  due  to  the  shortening  of  code  words. 

In  addition,  there  are  storage  costs. 

We  can  now  refer  to  the  "four- link"  chain  (b)  of  Section  2.2  . 
Messages  to  be  sent  are  stored,  encoded,  received,  and  decoded.  The 
benefit  (fidelity  criterion)  depends  on  the  messages  to  be  sent  and  on 
the  decoded  messages;  the  expected  benefit  will  depend  on  the  probability 
distribution  tt  characterizing  the  source  (ite.,  the  messages  to  be 
sent)  and  on  the  15arkov  matrices  characterizing  ccnsei  ‘ive  processings. 
Costs  and  delays  arise  at  each  processing  link,  and  their  distribution 
(and  hence  expectation)  depends,  too,  on  rr  and  those  Ksrkov  matrices. 

We  have,  however,  just  remarked  that  the  four- link  chain  is  merely 
e  part  of  the  total  information  system,  in  which  benefit  depends  on 
events  and  actionp .  Events  are  transformed,  by  inquiry,  into  observa¬ 
tions  ("data1  ).  -these  are  the  messages  to  be  sent,  the  initial  input 

cutout, 

of  the  ccmmunicat-’  on  systeu;  and  its  final  /  the  decoded  messages, 
are  transformed  into  actions  by  applying  strategies .  We  have  thus  added 
two  links,  one  at  each  end  of  tie  communication  chain.  It  remains  true 
that  the  probability  distribution  (and  hence  the  expectation)  of 
benefits,  costs,  and  delays  depends  on  the  initial  distribution  tt 
(now  attached  to  events,  not  to  messages  received)  and  on  the  successive 
Markov  matrices. 

We  can  also  regard  a  communication  system  as  a  special  case  of  the 
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general  informatior  system;  viz.,  one  in  which  the  processing  of  events 
into  data  and  the  processing  of  decoded  messages  into  action  are  char¬ 
acterized  by  identity  transformations  and  by  zero-costs  ami  zero-delays . 

J.k  Noisy  channel:  transmission  rate  and  capacity*  To  concentrate 
on  the  properties  of  a  channel,  it  will  be  convenient  to  reinterpret  our 
nctati onal  symbol  again.  let  us  now  designate  channel  inputs  by  z  in 
Z,  and  ivS  outputc  by  y  In  I,  analogous  to  the  "events"  and  "obser¬ 
vations"  of  Sections  3-5.  Channel  inputs  z,  the  digits  of  the  encoded 
message,  occur  with  probabilities  tr  .  Channel  outputs,  y,  the  digits 
received  at  the  channel's  end,  occur,  for  a  given  z,  with  conditional 

probabilities  p(y!z)  =  T|  ,  elements  of  the  Markov  matrix  T),  called 

Zy 

the  channel  matrix,  line  channel  ie  noiseless  if  71  is  the  identity 
matrix.  The  joint  probability  of  z  and  £  and  the  marginal  probability 
of  y  are,  respectively  (see  footnote  to  Section  5.k,  on  notations). 


v(z,y) 

p(y)  = 


nz\y 


z€Z 


z  'zy 


It  will  be  convenient  to  give  a  special  symbol,  6  (an  element  of  the 

yz 

Markov  matrix  6  =  [6  ] )  to  the  posterior  probability  of  z,  given  jr. 

Clearly  6  depends  on  tt  and  f): 


yz 


=  p(z’y)  =  p(z»y)/p(y)  = 

=  TT  V,  ]  E  Tt  V 

2  zr  u€2  u 


call  "uncertainty  about  z,  retained  after  digit  y 
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was  received  through  the  channel",  the  expression 

H(z|y)  »  -S  Syjlog  , 

z 

and  to  call  its  expectation 

(7.4.1)  -2  p(y)H(Z|y)  =  H(ZlY)  , 

y 

the  "uncertainly  uetalryjd,"  in  L.  Breiman's  [i960]  suggestive  language. 

It  is  clear  fTcm  its  definition  that  H(z|Y)  depends  only  on  the  prob¬ 
ability  distribution  tt  and  Tl,  and  ve  want  to  emphasize  this  by  writing 
occasionally 

H(zlY)  £  . 

Ifce  quantity  (never  negative) 

(7.4.2)  H(Z)  -  H(Z|Y)  s  I(Z,Y)  s  I(Y,Z) 

has  been  called  "uncertainly  removed"  or  "amount  of  information  transmitted." 
Because  of  the  symoetry  vith  respect  to  Z,  Y,  which  i  1  easily  shown,  it 
has  also  been  called  "mutual  information."-^  Clearly,  it  depends  on  n 

*/  H.  Theil  [1967],  [1968]#  uses  the  difference  H(Z)  -  B(Z|Y)  to  measure, 
for  example,  the  discrepancy  between  the  predicted  and  the  actual  composi¬ 
tion  of  c.  balance  n.  national  income,  or  seme  other  total.  Of 
course,  this  measure  can  be  used  outside  of  economics  as  veil;  and  It  is 
related  to  information  mainly  because  the  same  formula  has  been  used  in 
the  theory  of  communication  as  developed  by  C.  Shannon  and  others.  This 
explain*  the  difference  in  content  between  Theil‘s  studies  and  those  pre¬ 
sented  here,  in  spite  of  the  similarity  of  titles. 


EhO  OF  FQOTKOEB 
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and  T|  only,  and  it  vill  "be  convenient  to  vrite 
(7.^.3)  H(tt)  -  =  K(tr,T|)  > 

say.  Shannon's  "generalized  first  theorem"  states  that  K(tt,T))  is  the 
lower  limit  of  the  expected  number  of  binary  digits,  needed  to  identify 
(by  appropriately  decoding  the  digit  sequence  received)  each  digit  put 
through  the  channel.  Thus  K(tt,T])  is  measured  by  a  number  of  bits,  divided 
by  the  number  of  digits  put  through  the  chancel. 

In  Section  7.2,  the  speed  of  a  channel,  v  digits  per  time  unit  (say) 
was  introduced.  If  we  multiply  it  by  K(tt,h)  bits  per  digit,  we  obtain 

(7*4.5)  v( digits/ time)  x  K(n,p)(bits/digit)  *  v»K(n,T))(bits/time)  , 

a  quantity  called  transmission  rate.  Sene  confusion  is  present  in  text¬ 
books  though  certainly  not  in  engineering  practice,  by  choosing  the  time 
unit  so  as  to  make  v  =  1  for  convenience ,  and  not  stating  this  very 
explicitly.  Yet  the  distinction  between  "uncertainty  removed"  and 
"uncertainty  removed  per  time  unit"  is  of  economic  importance .  If 
(though  not  only  if:  see  Section  7*6  below)  transmission  causes  garbling, 
in  the  formal  sense  o"  our  Section  4.2,  the  number  K(tt,11)  of  bits  per 
digit  decreases.-'  Thus  variations  of  "uncertainty  removed"  can  affect 

*J  It  is  easily  seen  that,  in  fact,  when  the  channel  is  noiseless  (i.e., 

1}  is  an  identity  matrix)  then  J(tT,T^  =  0,  K(it,T|)  =  H(n).  That  is, 
for  given  tt,  uncertainty  retained  is  at  its  minimum,  and  uncertainty 
removed  reaches  its  maximum,  when  the  channel  is  noiseless. 
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expected  'benefit  because  of  possible  garbling.  But  another  factor 
affecting  expected  benefit  Is  the  delay  in  transmission.  An  accurate 
but  slow  transmission  may  have  the  same  value  to  the  user  as  an  inaccurate 
but  fast  one. 

By  (7.^.5)  the  transmission  rate  depends  on  v,  rr,  and  T).  If  v 
and  T|  are  kept  constant  but  rr  varies  over  the  set  of  all  probability 
vectors  of  order  a,  the  transmission  rate  will  vary,  and  its  maximum 
is  called  the  capacity  of  the  channel.  It  depends  on  v  and  7]  (and 
thus  also  on  m,  for  1)  is  of  order  m  X  n).  However,  in  theoretical 
discussion  v  is  usually  put  =  1,  making  the  capacity,  denoted  by  C, 
depend  on  7)  (and  thus  m)  only.  In  this  notation  we  have,  for  any  v 


max  K(tt,ti)v  =  C(7))«v  bits  per  time  unit. 

TT 

7*5  Capacity  and  cost.  It  can  be  presumed  that  the  cost  of  channel 
increases  with  v.  It  is  also  usually  assumed,  I  think,  that  channel 
cost  increases  with  C(7]).  Ibis  assumption  was  used  in 

Section  6.7,  where  a  formula  for  C(ts)  was  given  for  7*  binary  and 
y  =  1.  However,  it  is  not  too  clear  why  two  channels  with  two  different 
matrices  T],  71*  should  require  equal  costs  (of  construction,  mainte¬ 
nance  and  operation)  whenever  C(7])  =  C(7)').  For  example,  formula  (6.7.1) 
yields  approximately  (see  FigUre  4) 


C 


.17\ 


I  _ 


.83/ 


=  .3  =  C 


;•  5 

V1 


I 


The  matrix  on  the  right  is  exemplified  by  a  channel  which  transmits  every 
"no"  without  fault,  but  transforms  a  "yes"  into  a  "no"  half  of  the  time: 
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"You  will  Bend  a,  a  word  (through  a  vary  unreliable  messenger)  only  if 
you  decide  to  cone."  it  la  not  clear  vhy  the  uae  of  each  a  channel 
should  equal  in  cost  the  use  of  a  sonmvhat  more  reliable  messenger  eho 
mistakes  a  "yes"  for  a  "no",  or  conversely,  about  one  time  out  of  six, 
as  In  the  matrix  on  the  left.  I  suppose  data  on  such  cost  questions  are 
at  the  disposal  of  the  communication  industry.  As  far  as  I  can  see, 
theoretical  literature  does,  in  effect,  regard  all  channels  vith  equal 
capacity  as  equivalent  uith  respect  to  cost.  It  answers,  for  example, 
the  question:  "Vhat  is  the  >  -t  code  for  a  channel  uith  a  given  capa¬ 
city?  Yet,  the  user’s  economic  question  should  be:  "What  is  it  for 
a  channel  vith  a  given  cost?" 

^  feos  informativeness  alvavs  increase  uith  "UgsmtHm  - - 

Sotted?:  Ibe  answer  is  no.  let  o  be  any  convex  function  of  a  non- 
negative  variable.  One  such  function  is 

(7*6#1)  ^(x)  =  *logx,  Osxsl, 

since  uQ(x)  =  In  2/x  >  0.  The  following  has  been  proved:^ 

Theorem  If  ar*  -»(2)  -  r~(2)  t 

UZJr(l)J  n  •  (2)1  ««  tvo  infor¬ 
mation  matrices  then  >  r if 

i  and  only  if,  for  any  convex  function 


(7.6.2) 


r{m.S1^n  **  G1mM<*  t>»3.  part  *  of  Oheorem  12.2.2;  /Sgroot 
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•where,  as  in  Section  7.4,  p(y^)  and  6  /.\  are,  respectively,  the 

(k\  Z 

marginal  probability  of  yv  '  and  the  posterior  probability  of  z, 

fk) 

given  y'  ,  both  depend  on  the  distribution  tt  and  V  .  Consider 
nov  the  particular  convex  function  cpQ  defined  in  (7.6.1).  By  the 
definitions  of  Section  7*4, 

T  p(y)  S  ©0(6  )  -  J(tt ,“)  , 

y  ^  * 

•where  7,  =  [l]  ].  It  follows  from  the  above  theorem  that 
zy 

i  J(n,^2b  if  *^>7'*-); 

condition 

it  also  follow  that  the  converse  is  not  true  since  the  theorem  requires  / 
(7.6.2)  to  hold  for  all  convex  functions  and  not  just  for  <?Q.  It  further 
follows,  by  (7.4.3)*  that  the  condition 

K(n,71(1^)  *  K(tt,^(2)) 

is  necessary  but  not  sufficient  for  to  be  more  informative  han 

to) 

Tp  .  This  means  that  there  exist  distributions  tt  ana  benefit  functions 
(fidelity  criteria)  8  such  that  an  increase  in  K,  the  information 
transmitted, can  be  connir.tent  with  a  decrease  in  tie  expected  benefit. 

7.7  Efficient  coding,  given  a  fidelity  (benefit)  function.  Let 
us  continue  with  the  notations  of  Section  7*4.  A  channel  is  characterized 
by  speed  v,  and  by  a  Markov  matrix  T],  which  transforms  channel  inputs 
z  in  Z  (occurring  with  probabilities  rr  )  into  channel  outputs  y 
in  Y.  Nov,  the  channel  is  a  processing  link  intermediary  between  two 
others.  On  the  one  hand,  at  its  exit,  outputs  must  be  decoded;  and,  as 
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■before,  we  identify,  in  the  context  of  communication  theory,  the  results 

of  decoding  (decoded  messages )  with  benefit-relevant  actions  a  in  A 

identify" 

(where  the  sets  A  and  Z  are  identical),  and,  hence, /the  decoding 
transformation  with  the  strategy  a.  On  the  other  hand,  the  benefit- 
relevant  events  are  not  the  channel  inputs  but  the  messages  to  be  sent. 
These  are  transformed  into  channel  inputs  by  a  processing  called  encoding, 
possibly  preceded  by  storing,  as  indicated  in  Section  7.3.  Keglect 
storing  for  a  moment  (ice.,  assume  it  to  be  characterized  by  identity 
transformation,  zero  costs  and  zero  delays)  and  denote  the  messages  to 
be  sent  by  £  in  S,  and  their  probability  distribution  by  a.  Denote 
bv  v  the  speed  of  inflow  of  these  messages.  An  encoding  Markov 

matrix  e  (possibly  noiseless)  transforms  S  into  Z;  and  clearly  a 
and  c  completely  determine  the  distribution  u  on  Z,  To  be  feasible, 
an  encoding  matrix  e  is  conditioned  on  sens  costs  and  delays,  as  is 
the  decoding  matrix  or.  These  costs  and  delays  are  presumably  increasing 
with  the  length  of  code  words,  and  also  with  the  number  of  code  words 
(size  of  ''dictionary” ) .  The  pair  (c,a)  is  called  code. 

Given  <t  and  the  benefit  (fidelity)  function  8  on  A  x  S,  ve 
can  express  the  expected  benefit  thus,  analogous  to  (3*6.3): 


If  the  channel  is  noiseless,  its  matrix  1]  is  an  identity  matrix,  1. 


max 

e,a 


BCT0(l,e,cr) 


» 


Write 
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Is  over  all  pairs  (s,c).  Die 
if  it  is  proved  that 

for  all  Tides'  . 

In  this  notation.  Shannon’s  " second  theorem”  generalized  for  the  case 
of  any  fidelity  criteric^^rather  than  the  special  one  cf  (7*1. l)>  can 
he  stated  thus: 

For  any  positive  k,  and  given  a,  S,  T),  there  exists  a 
code  (c *,ar*)  such  that 

-  *„,(!!,«'%«•)  *  k  , 

provided  H(<t)»v  <  C(tj)*v  . 

The  left  side  of  the  upper  Inequality  becomes  the  "probability  of  error” 
p  in  the  special  case  when  all  errors  arc  assigned  equal  penally  as 
in  (7.1.1).  The  theorem  is  then  reduced  to  its  original  formulation. 

(in  addition,  the  speeds  v  and  v  are  often  taken  to  be  equal.) 

The  code  suggested  by  Shannon  [i960]  to  prove  the  generalized  second 

theorem  is  of  a  particular  form,  in  two  respects:  « 

transforming 

(1)  the  encoding  consists  of  two  steps,  first  1  /•  each 

message  to  be  sent,  s,  into  what  may  be  called  "appropriate  action  under 

certainty,"  e  in  A,  such  that  8{a  ,s)  =  max  8(a,s)>  and  than 
8  8  a€A 

transforming  each  a  into  a  channel  input,  another  element  of  A: 

8  of  these  two 

(2)  the  second /steps  is  the  same  far  any  two  channel  matrices 

W  with  c(-n)  *  c(-n»). 

*J  Sec  Shannon  [i960],  Jellinsk  * Pham  [1968]. 


where  the  maximization  on  the  left  side 
notation  on  the  right  side  is  justified 


p.  7.13 


However,  vlth  or  without  the  particular  restriction  presented  "by 
Shannon's  double  coding,  the  code  (e*,<y*)  may  require, for  k  small, 
long  code  words  and  writing  for  long  sequences  of  messages  to  be  sent. 

As  discussed  earlier  (Sections  7*2>  7.3)>  long  code  words  cause  delays. 

Long  sequences  presuppose  storing.  Therefore,  to  realize  a  code  (e*,a*) 
for  a  small  k,  it  is  not  possible  to  neglect  (as  we  have  done  at  the 
beginning  of  this  Section)  the  storage  of  messages  that  must  precede 
their  encoding.  And  this  introduces  additional  delays. 

7»o  Demand  for  ccnmipjcatiou  links.  The  cost  of  each  processing 
link  (storing,  encoding,  transmitting,  decoding)  will  depend  on  the 
characteristics  of  its  transformation  matrix  but  it  may  also,  in  general, 
vary  with  its  inputs,  as  in  Section  1.1.  Thus  the  expected  cost  of  en¬ 
coding  will  depend  on  probabilities  a  of  tie  various  messages  to  be 

6 

encoded;  the  expected  cost  of  transmitting  will  depend  on  the  a  e  z 

8  SZ 

and  that  of  decoding,  on  the  cr  e  V  .  And  similarly  with  expected 

s  sz  zy 

delays.  This  is  simplified  if,  as  in  Section  6-6,  the  cost  and  delay  of 

each  processing  depends  on  the  transformation  characterizing  it  (e,‘n,c) 

but  not  on  the  input;  and  if  the  same  is  assumed  of  delays*  The  sum  of 

costs  of  the  licks  is  then  subtracted  from  the  expected  benefit;  and  the 

latter  is  affected  by  the  delays  in  the  several  links,  especially  because 

of  the  diminution  cf  expected  benefit, caused  by  the  obsolescence  of  actions 

(here:  decodings),  as  in  Section  Z. 

However,  most  of  the  existing  literature  lets  each  link  be  cscocicted, 
with  with 

not  /  its  costs  and  delays,  but  /  characteristics  such  as  channel 
capacity,  length  of  the  code  word,  and  size  of  the  code  dictionary.  A 
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question  such  as  the  following  is  asked: 

given  the  channel  capacity,  the  (expected)  word  length, 

and  the  code  size,  hov  large  an  expected  fidelity  can 

*/ 

he  achieved?—' 

Answering  such  a  question  would  not  really  provide  the  set  of 
communication  systems  efficient  from  the  point  of  view  rf  a  given  user, 
characterized  hy  a  fidelity  function  and  a  probability  distribution  of 
messages  to  be  sent.  We  remarked  in  Section  7 .5  that  two  channels  with 
equal  capacity  (and  speed)  need  not  have  equal  cost.  As  to  the  length 
(or  more  generally,  the  expected  length)  of  code  words,  it  is  due  to 
delays;  and  these  influence  expected  utility  to  the  user,  not  by  being 
added  to  costs  but  through  a  complicated  effect  on  expected  benefit, 
especially  by  making  decisions  obsolete.,  as  we  have  just  remarked. 
Expected  utility  cannot  be  decomposed  additive Xy  into  expected  benefit, 

channel  capacity  and  (expected)  word  length;  that  is,  utility  is  not 
linear  in  these  quantities.  (Similar  confide rat ions  would  apply  to  the 
size  of  code).  Yet  without  such  additivity  answers  to  a  question  like 
the  one  just  formulated  would  not  provide  the  set  efficient  from  a  given 
user's  point  of  view  (see  Appendix  I). 

In  a  sense,  the  set  of  non-domi Bated  quadruples  (expected  fidelity, 
channel  capacity,  expected  word  length,  code  size)  is  the  result  of  a 

...  ■■  — — -  but 

*/  This  is  the  formulation  given  by  Wolf  owl  tz  [19613, /generalized  in  two 
respects:  by  introducing  a  general  fidelity  criterion  instead  of  an 
equal  penalty  for  all  errors;  and  by  permitting  the  cede  words  to  vary  - 
in  length,  thus  presumably  increasing  coding  efficiency.  I  must  acknow¬ 
ledge  a  great  debt  to  Wolfowitz’s  clear  presentation  of  the  economic 
problem. 
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crude  "averaging"  over  all  losers.  Delays,  being  undesirable  for  all 
.  users,  ere  replaced  by  vbat  amounts  to  an  additive  cost,  as  a  make-do. 

This  gives  a  rough  guidance  to  the  supplier  of  -he  communication  links 
in  estimating  the  demand  for  them.  3he  '’emand  of  the  individual  user 
l  (if  he  is  "rational")  is  rather  different,  and  hence  that  crude  average 

|  cannot  represent  the  aggregate  demand. 

i 

i 


I 


p.  8.1 

G.  MARKET  FOR  DFOBIATION 

8.1  Demand  for  systems  and  sub-systems .  Return  nov  to  the  general 
outline  of  purposive  processing  chains  (and  networks*  for  that  matter) 
that  we  gave  in  Sections  1  und  2,  with  especial  regard  to  information 
systems.  The  individual  user  (meta-decider)  can  achieve  a  given  sequence 
of  transformations  only  at  certain  costs  and  with  certain  delays  (or, 
more  generally,  a  certain  probability  distribution  of  costs  and  delays). 
Subject  to  these  constraints,  he  should  maximize  the  expected  benefit 
simultaneously  with  respect  to  alt  of  the  transformations.  Just  like 
an  ideal  plant  designer  decides  simultaneously  about  the  size  and  com¬ 
position  of  the  personnel  as  well  as  of  the  machine  park,  the  warehouses 

and  the  transportation  facilities  1  ThiB  is,  ox  course,  hardly  ever 

*/ 

achieved  in  reality.-'  The  humb^r  me ta -decider  makes  his  choices  sepa¬ 
rately  for  each  of  several  sub-systems;  this  is  what  the  term  "sub- 
optimization"  is  cfter.  intended  to  mean,  I  believe.  Hopefully,  he 
partitions  the  total  system  in  such  a  way  that  the  complementarity 
between  sub-systems  (with  regard  to  expected  benefit)  is  small. 

The  failure  to  maximize  over  all  system  components  simultaneously 
is  Just  one  of  many  allowances  for  "lad:  of  rationality"  that  must  be 
made  before  we  claim  a  modicum  of  descriptive  validity  to  the  result 
of  aggregating  the  demands  of  individual  users  into  the  total  demand  for 
system  components  of  various  kinds,  given  the  constraints. 


*j  For  an  attempt  to  deal  more  formally  with  the  limitations  of  the  meta- 
ctecider  ("organizer")  see  Marschak  and  Radner  [in  press!.  Chapter  9. 


p 
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5«2  The  supply  side .  The  "demand  side"  of  the  maritot *  the  relation 
associating  the  set  of  constraints  vith  the  ect  of  demands*  depends  on 
the  benefit  functions  3  and  the  probability  distributions  rr  char¬ 
acterizing  individual  users.  Die  "supply  side"  is  the  relation  between  i.v_.  » 

the 

constraints  and/ supplies*  and  depends  on  the  "production  conditions" 
("technology")  characterizing  each  supplier.  Ps  usual*  the  economist 
is  almost  completely  ignorant  of  technology. 

Let  me  conclude  just  with  three*  rather  casual*  remarks  on  these 
production  conditions.  It  is  superfluous  to  remind  the  economist  that 
the  market  is  supposed  to  equalize  demand  and  Eupply*  and  the  demand  and 
supply  constraints. 

6.3  Standardization.  In  many  cases*  it  does  not  pay  to  produce 
"on  order."  Mass  production  may  be  cheaper.  This  may  explain  why  our 
Sunday  newspaper  is  so  bulky  (it  gives  all  things  to  all  subscribers)* 
and  why  our  telephones  have  such  a  high  fidelity.  The  individual  user 
is  "forced”  to  purchase  information  services  which,  for  him*  would  be 
wasteful  if  they  were  not  so  cheap. 

Packaging.  In  cur  scheme*  inquiry  was  presented  as  a  ccnpcnent 
separate  from  storing  the  deta*  encoding  as  separate  from  transmission* 
etc.  The  producer  of  automata  end  control  mechanisms  may  find  it  cheaper 
to  produce  them  jointly*  in  fixed  "packages."  This*  again*  imposes  con¬ 
straints  on  the  user*  similar  to  those  of  standardization. 

Standardization  and  packaging  are*  of  course*  not  peculiar  to  the 
production  of  information  services  and  are  present  in  other  markets,  I 
would  be  grateful  for  references,  especially  to  writings  of  6  more  formal 


kind 
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8.5  Man  vs ,  machine .  The  competition  between  machines  and  human 
nerves  (not  muscles)  is  much  discussed  today.  Some  symbol-manipu¬ 

lating  services  consist  in  nany-to-one  mapping,  variously  called  'sorting" 
and  "pattern-recognition."  Encoding  and  decoding  are  of  this  nature, 

hut  not  the  (generally  noisy)  transmission.  To  he  sure,  we  have,  in 
encoding  and  decoding 

Section  7,  characterized/by  Markov  matrices,  thus  allowing  for  "randomized 
codes."  Such  cedes  have  been  used  for  the  convenience  of  mathematical 
proofs.  But,  as  in  any  one-person  game,  there  exists  an  optimal  non- 
randanized  choice.  Except  to  allow  for  (non-rational)  error-making 
encoders  and  decoders,  we  may  as  well  consider  these  activities  as  many- 
to-one  mappings.  In  particular,  let  us  consider  the  "double  encoding" 
proposed  by  Shannon  [i960]  and  referred  to  ir.  our  Section  7.7.  We  can 
imagine  the  encoder  to  partition  a  set  of  visible  or  audible  stimuli, 
including  verbal  sentences,  into  equivalence  classes,  variously  called 
"patterns"  and  "meanings."  These  are  translated,  in  turn,  into  the 
language  of  channel  inputs  and  outputs,  and  then  decoded  back  into 
"patterns”  or  "meanings."  /.s  a  special  case,  we  may  be  little  concerned 
with  transmission  noise  --  newspaper  misprints  or  slips  of  the  tongue. 

With  the  channel  assumed  noiseless  the  inefficiency  of  double  encoding 
is  removed.  The  problem  of  the  best  code  remains:  what  is  the  best 
way  to  make  the  receiver  (a  listener  or  reader,  for  example)  to  "under¬ 
stand"  the  sender  (a  lecturer  or  writer)?  The  sender  must  encode  into 
a  well-chosen  set  of  patterns,  (an  "effective  style"  of  speech,  or  writing, 
for  example),  such  that  the  receiver  wo^ild  be  able  to  recognize  them, 
and  respond  to  then  by  benefit-maximizing  actions. 


We  are  tola  by  psycholinguists  —  e.g.,  Miller  [19671  that  man' 
effectiveness  as  a  char.  j1  (and  also  as  a  storage  facility)  is  poor 
compared  vith  inanimate  e quipnas nt  such  as  telephones  (and  record  tapes) 
Rit  his  coding  ability  seems  superb  in  many  cases.  It  is  variously 
called  "insight,1  "judgment,"  "ability  to  recognize  a  Gestalt  (pattern) 
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APPENDIX  I:  Requirement  of  Coirmensurable  Criteria. 

In  the  text,  utility  was  defined  on  each  pair  "event, 
action.”  It  is  sometimes  useful  to  introduce  an  additional 
concept — the  result,  r  (also  called  consequence)  of  the 
given  pair  ’’ event,  action”,  and  to  define  utility  as  a  func¬ 
tion  of  the  result.  The  result  need  not  be  numerical.  For 
example,  the  result* s  values  can  be  "getting  cured;  dying; 
continuing  in  ill  health."  v?hen  the  result  is  a  numerical 
vector,  and  utility  is  monotone  increasing  in  each  of  ics 

•*) 

components,  we  call  each  component  a  (desirable)  criterion. 

In  fact,  a  suggestion  has  been  made  to  replace  the  commodity 
space  of  usual  economic  theory  by  a  space  of  criteria  that  may 
"explain"  the  consumers’  preferences:  e.g.,  a  car  becomes  a 
bundle  of  criteria  3*ich  as  speed,  mileage  per  gallon  of  fuel, 
etc.  see  Lancaster  [1966].  END  OF  FOOTNOTE 

Thus 

action  =  a  ;  event  *  z  ; 

result  r  *  (r^,...,rn),  with  every  r^  numerical? 
r^=  p^(a,z)  (i-th  "result  function"); 

utility  u  =  w  (r^^, . .  .,rn) ; 

v  (r^, . . ., t^) ^  v (r£, . . . , r^)  if 

r.>r!  for  some  i  ,  r.>  r!  for  all  i  . 
x  x  \r~  x 

Consider  a  case  when  n  =  1  :  suppose,  e.g.,  the  decision¬ 
maker  maximizes  the  expected  utility  of  money  profit.  The 
unique  component  of  the  criterion  vector  is  then  a  dollar 
amount.  It  is  well  known  that,  in  this  case,  expected  utility 
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is  not  necessarily  monotone  in  expected  money  profit  (inde¬ 
pendently  some  other  parameters  of  the  distribution  of 
mouey  profit  such  as  variance)  unless  utility  is  linear  in 


money . 

Before  we  generalize  t-5  the  case  of  n  components, 
note,  as  an  example,  that  the  pair  "minus  cost,  numerical 
benefit"  constitutes  a  vector  consisting  of  two  criteria,  in 
Section  7.8  the  following  criteria,  used  in  communication 
theory,  were  listed:  fidelity  criterion?  length  of  code  word; 
size  of  code;  capacity  of  channel  (provided  of  course  that 
the  last  three  numbers  be  replaced  by  their  negatives) . 

Given  the  distribution  tt  of  events  z  ,  the  action  will 
result  in  some  joint  distribution  of  r^...,^  ,  to  be  denoted 


ir  (rx,  ...,rn). 

Consequently,  action  ei  will  yield  expected  utility 


(A.l)  E  (u)  = 


v  (r. , ..., r  )  tt  (r. , . . ., r  )  • 


rl* ‘ *rn 


Given  the  action  a.  ,  and  thus  the  joint  distribution  tt  , 
the  marginal  probability  distribution  of  a  particular  criter¬ 
ion  ,  for  example  of  r^  ,  will  be  denoted  by 


fra(r  '  = 


r2  ’  *  *  rn 


. r„>! 
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no  ambiguity  results  from  using  the  same  symbol — here  rru  — 
for  two  different  functions,  made  distinguishable  by  their 
different  arguments,  in  parentheses . 

Then  the  expected  value  of  r^  ,  given  action  a,  ,  is 

(A. 2)  E  (r.)  «  2  r.  Ta  (r.)  . 

c  X  ^  1  1 

The  vector  of  expected  criterion  values  will  be  denoted by 
[EaJ  -  (Ea(rl),...,Ea(rn)]. 

Given  two  actions  a  and  b  ,  we  say,  as  usual,  that  [El 
dominates  [E^]  ,  and  write  (Ea)  dom  [E^]  ,  if 

Ea(ri)  Eb(ri)  '  a11  1 

Ea<ri)  >  E^r^  *  80me  1  • 

then 

Ve  shall/also  say  that  action  a  dominates  b  with  respect 
to  criterion  expectations. 

Suppose  that 

Ea(u)  >  Ejjtu)  whenever 

(A.  3) 

[Ea]  dom  [Ej^J  . 

Clearly  this  is  equivalent  to  saying  that  expected  utility 
E_ (u)  is  a  monotone  increasing  function  of  the  expected 

ci 

criterion  values  E_(r, ),..., E  (r  )  .  If  this  is  the  case 

a  jl  an 

then,  and  only  then,  the  feasible  action  a*  (say)  that  maxi¬ 
mizes  each  of  the  E  (r.)  will  also  maximize  E  (u)  . 
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Suppose  the  utility  function  u  is  not  known;  but  condi¬ 
tion  (A. 3),  or,  equivalently,  the  monotonicity  of  E„ (u) 
with  respect  to  the  criterion  expectations  E_ (r.) , . »  .,E  (r_) 

is  known  to  hold.  Then,  while  it  is  not  possible  to  determine 
an  optimal  action  one  can  at  least  eliminate  all  actions  tha': 


are  dominated  by  some  feasible  action.  The  remaining  subset 
of  feasible  actions  will  be  then,  as  usual,  called  the  efficie-  . 


set. 


Consider  now  the  case 


“  "  ”<r>  ■  ri  +  r2  +  •••+  rn  : 


then  by  (A .  1) , 


E  (u)  «  2  r.Tra(r., . .  .,r  )+. . .+  2  r  Ta(r.,,...r) 

o  ^  ^  i  i  n  ^  ^  n  i  ** 


rl*  “rn 


ri *  *  *  rn 


then  by  {A. 2) 


E  (u)  =  2  r.Tra(r.)  +  . ..+  £  r  ira(r  ) 

a  ii  n  n 

rl  rn 

*  E  (r.)+. ..+  E  (r  )  , 
a  i  an 


an  obvious  result  ("Expectation  of  sum  =  sum  of  expectations  "! 
We  shall  now  prove 


Theorem. 


CSS 


cted  utility  is  monotone  in  expected  criterion 


values  if  and  only  if  utility  is  linear  in  the  criteria. 


Clearly,  the  conclusion  of  this  theorem  ("the  expected  utility 
is  monotone  in  expected  criterion  values")  could  be  replaced 
by  the  following  equivalent  propositions: 


(i)  "if  action  a  dominates  action  b  with  respect 

to  expected  criterion  values  then  a  is  pre¬ 
ferred  to  b  "  t 

(ii)  "The  efficient  set  consists  of  all  those  feasible 

actions  which  are  not  dominated,  with  respect  to 
expected  criterion  values,  by  any  feasible  action. 

(iii)  "An  action  that  maximizes,  over  the  set  of  feasible 

actions,  the  expected  value  of  each  criterion,  is 
optimal." 

By  substituting  any  of  these  three  sentences  tor  the  conclu¬ 
sion  of  the  Theorem,  we  obtain  three  theorems  equivalent  to  it 
The  "if"  part  of  Theorem  is  obvious  since  a  sum  is  a 
monotone  increasing  func.  .on  of  its  components .  It  is  un¬ 
fortunate  that  the  "only  if"  part  is  also  true.  For  it 
follows  that  unless  it  is  known  that  utility  is  additive  the 
computation  of  expected  criterion  values  loses  much  of  its 
usefulness:  an  action  b  dominated  by  some  other  action  a 
with  respect  to  the  expected  criteria  may  still  be  preferable 
to  a  ,  and  may  indeed  by  optimal,  unless  of  course  some 
further  conditions  are  known  to  exist  [e.g.,  distributions 
va(r),  ir^(r),...  yielded  by  all  feasible  actions  are  known  to 
belong  to  some  special  class-Gaussian,  for  example] . 


Then  for  every  i=l, . . .  ,n  , 

Ea(ri)  =  “i  +<1-ct)ri  * 

Vr.)  «  1*7.  =  osr?  +  (l-a)r!  =  E  (r.)  . 

X  XX  X  cl  X 

Hence  E^tr^)  =  Ea^ri^'  a11  i  * 

On  the  other  hand, 

E  (u)  =  a  v  (r°)  +  (1-a)  v(r') 

E^tu)  =  u(a  r°  +  (1-a)  r*)  . 
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* 


Suppose  expected  utility  of  any  action  is  monotone  in  the 
expected  criterion  values  (resulting  from  that  action) .  Then, 
since  E^fr  )  =  E^r.^)  for  all  i  ,  we  must  have 


E£(u)  =  E^u) , 


a  v  (r°)  +  (1-a)  u(r*)  *  u  (a  r°  +  (l-a)r'). 

This  is  possible  only  if  the  function  v  on  the  space  of 
vectors  r  is  linear,  i.e.,  if  there  exist  w^(i=0,l, . . .  ,n) 
such  that 


u(rL, ...,rn)=  wq 


n 

Z  w^r^ 


It  is  then  often  said  that  the  criteria  are  "commensurable" 

(among  each  other  and  with  utility  itself) .  A  most  common 
case  is  to  convert  them  in  dollars,  under  the  (sometimes  tacit) 
assumption  that  utility  is  linear  in  dollars. 
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Regions  with  shaded  boundaries  are: 

fC  consisting  of  inquiries  more  informative  than  Ti 

•i  ' 

R^  consisting  of  inquiries  less  informative  than  71 


FIGURE  1.  Informativeness  of  binary  inquiries 


Given  the  decider's  characteristics  r^  rg,  n^t  the  region 
with  shaded  boundaries  is 

SQ  consisting  of  useless  inquiries. 

She  renaining  Region  of  the  half-square  above  the  aain  diagonal 
consists  of  useful  inquiries.  It  contains  indifference  lines 
all  parallel. 

n  is  the  same  point  as  on  Figure  1. 

FIGURE  2.  Values  of  binary  inquiries  in 
the  case  of  2  actions. 


benefit  function 


FIGURE  5-  Inquiring,  Conraunicating,  Deciding. 

(Cost  functions  are  y  »Y  >  Y- »  Yr  »Y 
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users  will  depend  on  the  joint  supply  conditions  for  the  various 
system  components.  It  will  thus  depend,  for  example,  on  the  cost 
economies  due  to  the  "packaging'’  of  several  components,  to 
standardization  and  large-sca.e  production.  This  opens  up  the 
question  whether  social  interest  is  best  served  by  a  competitive 
market  in  information  processing  equipment  and  services,  human  as 
well  as  inanimate. 

For  simplicity,  we  have  assumed  that  utility  (the  quantity  whose 
expected  value  is  maximized  by  the  user)  is  the  difference  be¬ 
tween  costs  and  benefits.  The  current  literature  on  communication 
assumes  implicitly  that  other  choice  criteria  (such  *«»  the  length 
of  a  code  word)  are  additive,  and  that  channels  with  equal  capa¬ 
city  are  equally  costly.  These  assumptions  may  need  to  be  quali¬ 
fied,  by  studying  channel  costs  and  the  economic  effects  of 
corminication  delays. 
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An  information  system  is  a  chain  (or,  more  generally,  a  network)  of 
symbol-processing  components,  each  characterized  by  costs  and  delays, 
and  by  the  probabilities  of  its  outputs,  givan  am  input.  In  recent 
times,  statisticians,  engineers,  amd  even  philosophers  have  all  shown 
increasing  tendency  to  accept  the  economist's  way  of  comparing  inform¬ 
ation  systems  according  to  their  average  costs  and  benefits, — the  forme* 
depending,  in  part,  on  the  delays  between  the  events  inquired  about  and 
the  actions  decided  upon. 

Statisticians  have  concentrated  on  the  economic  choice  of  only  these 
two,  the  initial  amd  the  terminal  components  of  the  system:  "inquiry" 
amd  "decision  rule" .  And  they  have  tended  to  neglect  the  processing  de¬ 
lays  arising  in  these  as  well  as  in  the  intermediate  components  of  a 
system.  Engineers,  on  the  other  hamd,  have  concentrated  on  the  inter¬ 
mediate  components  that  form  the  "communication  sub-chain" :  "memor¬ 
izing",  "encoding",  "tramsmitting" ,  "decoding".  And  they  have  been 
concerned  with  the  processing  delays  that  depend  on  the  average  number 
of  code  symbols  needed  (and  thus  on  the.  "entropy"  to  be  removed  by 
communication) . 

The  economically  minded  user  must  consider  the  several  system  com¬ 
ponents  jointly;  amd  it  turns  out  that,  in  certain  important  cases, 
the  average  difference  between  the  benefit  and  cost  to  a  user  is  maxi¬ 
mized  by  large-scale  demand.  Moreover,  the  aggregate  demand  of  all 
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